注意

点击此处下载完整的示例代码

03. 在 KITTI 数据集上训练 Monodepth2¶

这是一篇使用 Gluon CV 工具包在 KITTI 数据集上训练 MonoDepth2 的教程。读者应具备深度学习基础知识，并熟悉 Gluon API。新用户可以先阅读60 分钟 Gluon 速成课程。您可以立即开始训练或深入探索。

立即开始训练¶

提示

请随意跳过教程，因为训练脚本是完整的，可以直接运行。

下载完整 Python 脚本： train.py

下载完整 Python 脚本： trainer.py

单目+双目模式训练命令

python train.py --model_zoo monodepth2_resnet18_kitti_mono_stereo_640x192 --model_zoo_pose monodepth2_resnet18_posenet_kitti_mono_stereo_640x192 --pretrained_base --frame_ids 0 -1 1 --use_stereo --log_dir ./tmp/mono_stereo/ --png --gpu 0 --batch_size 8

单目模式训练命令

python train.py --model_zoo monodepth2_resnet18_kitti_mono_640x192 --model_zoo_pose monodepth2_resnet18_posenet_kitti_mono_640x192 --pretrained_base --log_dir ./tmp/mono/ --png --gpu 0 --batch_size 12

双目模式训练命令

python train.py --model_zoo monodepth2_resnet18_kitti_stereo_640x192 --pretrained_base --split eigen_full --frame_ids 0 --use_stereo --log_dir ./tmp/stereo/ --png --gpu 0 --batch_size 12

更多训练命令选项，请运行 python train.py -h。请查看模型动物园，以获取重现预训练模型的训练命令。

深入探索¶

import numpy as np
import mxnet as mx
from mxnet import gluon, autograd
import gluoncv

深入研究自监督单目深度预测¶

https://raw.githubusercontent.com/KuangHaofei/GluonCV_Test/master/monodepthv2/tutorials/monodepth2.png

（图片来源：Godard 等人）

自监督单目深度估计（Monodepth2）[Godard19] 构建了一个简单的深度模型，并通过利用空间几何约束以自监督的方式进行训练。Monodepth2 的关键思想是它构建了一种新颖的重投影损失，包括 (1) 用于稳健处理遮挡的最小重投影损失，(2) 减少视觉伪影的全分辨率多尺度采样方法，以及 (3) 用于忽略违反相机运动假设的训练像素的自动遮罩损失。

Monodepth2 模型¶

Monodepth2 使用了一个简单的 U-Net 架构，它结合了不同感受野大小的多尺度特征。它将特征图汇聚成不同大小，然后在上采样后进行拼接。有两个解码器分别用于深度估计和相机姿态估计。

编码器模块是 ResNet，它接受单张 RGB 图像作为深度模型的输入。对于姿态模型，姿态编码器经过修改，接受一对帧（即六个通道）作为输入。因此，姿态编码器第一层的卷积权重形状为 6×64×3×3，而不是 ResNet 默认的 3×64×3×3。在使用姿态编码器的预训练权重时，第一个预训练的滤波器张量沿通道维度复制以形成形状为 6 × 64 × 3 × 3 的滤波器。这个新扩展滤波器中的所有权重都除以 2，以使卷积的输出与原始单图像 ResNet 的输出在数值范围上相同。

编码器定义如下：

class ResnetEncoder(nn.HybridBlock):
    def __init__(self, backbone, pretrained, num_input_images=1,
                 root=os.path.join(os.path.expanduser('~'), '.mxnet/models'),
                 ctx=cpu(), **kwargs):
        super(ResnetEncoder, self).__init__()

        self.num_ch_enc = np.array([64, 64, 128, 256, 512])

        resnets = {'resnet18': resnet18_v1b,
                   'resnet34': resnet34_v1b,
                   'resnet50': resnet50_v1s,
                   'resnet101': resnet101_v1s,
                   'resnet152': resnet152_v1s}

        num_layers = {'resnet18': 18,
                      'resnet34': 34,
                      'resnet50': 50,
                      'resnet101': 101,
                      'resnet152': 152}

        if backbone not in resnets:
            raise ValueError("{} is not a valid resnet".format(backbone))

        if num_input_images > 1:
            self.encoder = resnets[backbone](pretrained=False, ctx=ctx, **kwargs)
            if pretrained:
                filename = os.path.join(
                    root, 'resnet%d_v%db_multiple_inputs.params' % (num_layers[backbone], 1))
                if not os.path.isfile(filename):
                    from ..model_store import get_model_file
                    loaded = mx.nd.load(get_model_file('resnet%d_v%db' % (num_layers[backbone], 1),
                                                       tag=pretrained, root=root))
                    loaded['conv1.weight'] = mx.nd.concat(
                        *([loaded['conv1.weight']] * num_input_images), dim=1) / num_input_images
                    mx.nd.save(filename, loaded)
                self.encoder.load_parameters(filename, ctx=ctx)
                from ...data import ImageNet1kAttr
                attrib = ImageNet1kAttr()
                self.encoder.synset = attrib.synset
                self.encoder.classes = attrib.classes
                self.encoder.classes_long = attrib.classes_long
        else:
            self.encoder = resnets[backbone](pretrained=pretrained, ctx=ctx, **kwargs)

        if backbone not in ('resnet18', 'resnet34'):
            self.num_ch_enc[1:] *= 4

    def hybrid_forward(self, F, input_image):
        self.features = []
        x = (input_image - 0.45) / 0.225
        x = self.encoder.conv1(x)
        x = self.encoder.bn1(x)
        self.features.append(self.encoder.relu(x))
        self.features.append(self.encoder.layer1(self.encoder.maxpool(self.features[-1])))
        self.features.append(self.encoder.layer2(self.features[-1]))
        self.features.append(self.encoder.layer3(self.features[-1]))
        self.features.append(self.encoder.layer4(self.features[-1]))

        return self.features

解码器模块是一个带有跳跃连接的全卷积网络，它利用不同尺度的特征图，并在上采样后进行拼接。最后一层使用 sigmoid 激活函数。它将输出限制在 [0, 1] 范围内，这意味着深度解码器输出的是归一化的视差图。

它定义如下：

class DepthDecoder(nn.HybridBlock):
    def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1,
                 use_skips=True):
        super(DepthDecoder, self).__init__()

        self.num_output_channels = num_output_channels
        self.use_skips = use_skips
        self.upsample_mode = 'nearest'
        self.scales = scales

        self.num_ch_enc = num_ch_enc
        self.num_ch_dec = np.array([16, 32, 64, 128, 256])

        # decoder
        with self.name_scope():
            self.convs = OrderedDict()
            for i in range(4, -1, -1):
                # upconv_0
                num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
                num_ch_out = self.num_ch_dec[i]
                self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)

                # upconv_1
                num_ch_in = self.num_ch_dec[i]
                if self.use_skips and i > 0:
                    num_ch_in += self.num_ch_enc[i - 1]
                num_ch_out = self.num_ch_dec[i]
                self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)

            for s in self.scales:
                self.convs[("dispconv", s)] = Conv3x3(
                    self.num_ch_dec[s], self.num_output_channels)

            # register blocks
            for k in self.convs:
                self.register_child(self.convs[k])
            self.decoder = nn.HybridSequential()
            self.decoder.add(*list(self.convs.values()))

            self.sigmoid = nn.Activation('sigmoid')

    def hybrid_forward(self, F, input_features):
        self.outputs = []

        # decoder
        x = input_features[-1]
        for i in range(4, -1, -1):
            x = self.convs[("upconv", i, 0)](x)
            x = [F.UpSampling(x, scale=2, sample_type='nearest')]
            if self.use_skips and i > 0:
                x += [input_features[i - 1]]
            x = F.concat(*x, dim=1)
            x = self.convs[("upconv", i, 1)](x)
            if i in self.scales:
                self.outputs.append(self.sigmoid(self.convs[("dispconv", i)](x)))

        return self.outputs

PoseNet 解码器模块是一个全卷积网络，它使用轴角表示法预测旋转，并将旋转和平移输出按 0.01 缩放。

它定义如下：

class PoseDecoder(nn.HybridBlock):
    def __init__(self, num_ch_enc, num_input_features, num_frames_to_predict_for=2, stride=1):
        super(PoseDecoder, self).__init__()

        self.num_ch_enc = num_ch_enc
        self.num_input_features = num_input_features

        if num_frames_to_predict_for is None:
            num_frames_to_predict_for = num_input_features - 1
        self.num_frames_to_predict_for = num_frames_to_predict_for

        self.convs = OrderedDict()
        self.convs[("squeeze")] = nn.Conv2D(
            in_channels=self.num_ch_enc[-1], channels=256, kernel_size=1)
        self.convs[("pose", 0)] = nn.Conv2D(
            in_channels=num_input_features * 256, channels=256,
            kernel_size=3, strides=stride, padding=1)
        self.convs[("pose", 1)] = nn.Conv2D(
            in_channels=256, channels=256, kernel_size=3, strides=stride, padding=1)
        self.convs[("pose", 2)] = nn.Conv2D(
            in_channels=256, channels=6 * num_frames_to_predict_for, kernel_size=1)

        # register blocks
        for k in self.convs:
            self.register_child(self.convs[k])
        self.net = nn.HybridSequential()
        self.net.add(*list(self.convs.values()))

    def hybrid_forward(self, F, input_features):
        last_features = [f[-1] for f in input_features]

        cat_features = [F.relu(self.convs["squeeze"](f)) for f in last_features]
        cat_features = F.concat(*cat_features, dim=1)

        out = cat_features
        for i in range(3):
            out = self.convs[("pose", i)](out)
            if i != 2:
                out = F.relu(out)

        out = out.mean(3).mean(2)

        out = 0.01 * out.reshape(-1, self.num_frames_to_predict_for, 1, 6)

        axisangle = out[..., :3]
        translation = out[..., 3:]

        return axisangle, translation

Monodepth 模型在 gluoncv.model_zoo.MonoDepth2 中提供，PoseNet 在 gluoncv.model_zoo.MonoDepth2PoseNet 中提供。要获取使用 ResNet18 作为基础网络的 Monodepth2 模型，请执行以下操作：

model = gluoncv.model_zoo.get_monodepth2(backbone='resnet18')
print(model)

输出

MonoDepth2(
  (encoder): ResnetEncoder(
    (encoder): ResNetV1b(
      (conv1): Conv2D(3 -> 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
      (relu): Activation(relu)
      (maxpool): MaxPool2D(size=(3, 3), stride=(2, 2), padding=(1, 1), ceil_mode=False, global_pool=False, pool_type=max, layout=NCHW)
      (layer1): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu1): Activation(relu)
          (conv2): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu2): Activation(relu)
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu1): Activation(relu)
          (conv2): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu2): Activation(relu)
        )
      )
      (layer2): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(64 -> 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu1): Activation(relu)
          (conv2): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu2): Activation(relu)
          (downsample): HybridSequential(
            (0): Conv2D(64 -> 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          )
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu1): Activation(relu)
          (conv2): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu2): Activation(relu)
        )
      )
      (layer3): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(128 -> 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu1): Activation(relu)
          (conv2): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu2): Activation(relu)
          (downsample): HybridSequential(
            (0): Conv2D(128 -> 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          )
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu1): Activation(relu)
          (conv2): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu2): Activation(relu)
        )
      )
      (layer4): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(256 -> 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu1): Activation(relu)
          (conv2): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu2): Activation(relu)
          (downsample): HybridSequential(
            (0): Conv2D(256 -> 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          )
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu1): Activation(relu)
          (conv2): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu2): Activation(relu)
        )
      )
      (avgpool): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True, global_pool=True, pool_type=avg, layout=NCHW)
      (flat): Flatten
      (fc): Dense(512 -> 1000, linear)
    )
  )
  (decoder): DepthDecoder(
    (decoder): HybridSequential(
      (0): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(512 -> 256, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (1): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(512 -> 256, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (2): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(256 -> 128, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (3): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(256 -> 128, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (4): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(128 -> 64, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (5): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(128 -> 64, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (6): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(64 -> 32, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (7): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(96 -> 32, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (8): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(32 -> 16, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (9): ConvBlock(
        (conv): Conv3x3(
          (pad): ReflectionPad2D(

          )
          (conv): Conv2D(16 -> 16, kernel_size=(3, 3), stride=(1, 1))
        )
        (nonlin): ELU(

        )
      )
      (10): Conv3x3(
        (pad): ReflectionPad2D(

        )
        (conv): Conv2D(16 -> 1, kernel_size=(3, 3), stride=(1, 1))
      )
      (11): Conv3x3(
        (pad): ReflectionPad2D(

        )
        (conv): Conv2D(32 -> 1, kernel_size=(3, 3), stride=(1, 1))
      )
      (12): Conv3x3(
        (pad): ReflectionPad2D(

        )
        (conv): Conv2D(64 -> 1, kernel_size=(3, 3), stride=(1, 1))
      )
      (13): Conv3x3(
        (pad): ReflectionPad2D(

        )
        (conv): Conv2D(128 -> 1, kernel_size=(3, 3), stride=(1, 1))
      )
    )
    (sigmoid): Activation(sigmoid)
  )
)

要获取使用 ResNet18 作为基础网络的 PoseNet，请执行以下操作：

posenet = gluoncv.model_zoo.get_monodepth2posenet(backbone='resnet18')
print(posenet)

输出

MonoDepth2PoseNet(
  (encoder): ResnetEncoder(
    (encoder): ResNetV1b(
      (conv1): Conv2D(6 -> 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
      (relu): Activation(relu)
      (maxpool): MaxPool2D(size=(3, 3), stride=(2, 2), padding=(1, 1), ceil_mode=False, global_pool=False, pool_type=max, layout=NCHW)
      (layer1): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu1): Activation(relu)
          (conv2): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu2): Activation(relu)
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu1): Activation(relu)
          (conv2): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (relu2): Activation(relu)
        )
      )
      (layer2): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(64 -> 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu1): Activation(relu)
          (conv2): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu2): Activation(relu)
          (downsample): HybridSequential(
            (0): Conv2D(64 -> 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          )
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu1): Activation(relu)
          (conv2): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (relu2): Activation(relu)
        )
      )
      (layer3): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(128 -> 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu1): Activation(relu)
          (conv2): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu2): Activation(relu)
          (downsample): HybridSequential(
            (0): Conv2D(128 -> 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          )
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu1): Activation(relu)
          (conv2): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (relu2): Activation(relu)
        )
      )
      (layer4): HybridSequential(
        (0): BasicBlockV1b(
          (conv1): Conv2D(256 -> 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu1): Activation(relu)
          (conv2): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu2): Activation(relu)
          (downsample): HybridSequential(
            (0): Conv2D(256 -> 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          )
        )
        (1): BasicBlockV1b(
          (conv1): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu1): Activation(relu)
          (conv2): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (relu2): Activation(relu)
        )
      )
      (avgpool): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True, global_pool=True, pool_type=avg, layout=NCHW)
      (flat): Flatten
      (fc): Dense(512 -> 1000, linear)
    )
  )
  (decoder): PoseDecoder(
    (net): HybridSequential(
      (0): Conv2D(512 -> 256, kernel_size=(1, 1), stride=(1, 1))
      (1): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (2): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (3): Conv2D(256 -> 12, kernel_size=(1, 1), stride=(1, 1))
    )
  )
)

数据集和数据增强¶

准备 KITTI RAW 数据集
这里我们给出一个在 KITTI RAW 数据集 [Godard19] 上训练 monodepth2 的示例。首先，我们需要准备数据集。monodepth2 的官方实现没有使用 KITTI RAW 数据集的所有数据，这里我们使用与 [Godard19] 相同的数据集和分割方法。您需要下载分割的 zip 文件，并将其解压到 $(HOME)/.mxnet/datasets/kitti/。

按照命令获取数据集
cd ~ mkdir -p .mxnet/datasets/kitti cd .mxnet/datasets/kitti wget https://github.com/KuangHaofei/GluonCV_Test/raw/master/monodepthv2/tutorials/splits.zip unzip splits.zip wget -i splits/kitti_archives_to_download.txt -P kitti_data/ cd kitti_data unzip "*.zip"

提示

您需要 175GB 可用磁盘空间来下载和解压此数据集。建议使用 SSD 硬盘以获得更快的速度。准备数据集所需的时间取决于您的互联网连接和磁盘速度。例如，在具有 EBS 的 AWS EC2 实例上大约需要 2 小时。

我们在 gluoncv.data 中提供了自监督深度估计数据集。

例如，我们可以轻松获取 KITTI RAW Stereo 数据集

import os
from gluoncv.data.kitti import readlines, dict_batchify_fn

train_filenames = os.path.join(
    os.path.expanduser("~"), '.mxnet/datasets/kitti/splits/eigen_full/train_files.txt')
train_filenames = readlines(train_filenames)
train_dataset = gluoncv.data.KITTIRAWDataset(
    filenames=train_filenames, height=192, width=640,
    frame_idxs=[0, -1, 1, "s"], num_scales=4, is_train=True, img_ext='.png')
print('Training images:', len(train_dataset))
# set batch_size = 12 for toy example
batch_size = 12
train_loader = gluon.data.DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, batchify_fn=dict_batchify_fn,
    num_workers=12, pin_memory=True, last_batch='discard')

这里，frame_idxs 参数用于决定输入帧。它是一个列表，第一个元素必须是 0，表示源帧。其他元素表示目标帧。数值表示图像序列中的相对帧 ID。“s” 表示立体对中源图像的另一侧。

数据增强

我们遵循标准的数据增强流程来转换输入图像。这里，我们只对输入图像使用 50% 概率的 RandomFlip。

随机选择一个示例进行可视化

import random
from datetime import datetime
random.seed(datetime.now())
idx = random.randint(0, len(train_dataset))

data = train_dataset[idx]
input_img = data[("color", 0, 0)]
input_stereo_img = data[("color", 's', 0)]
input_gt = data['depth_gt']

input_img = np.transpose((input_img.asnumpy() * 255).astype(np.uint8), (1, 2, 0))
input_stereo_img = np.transpose((input_stereo_img.asnumpy() * 255).astype(np.uint8), (1, 2, 0))
input_gt = np.transpose((input_gt.asnumpy()).astype(np.uint8), (1, 2, 0))

from PIL import Image
input_img = Image.fromarray(input_img)
input_stereo_img = Image.fromarray(input_stereo_img)
input_gt = Image.fromarray(input_gt[:, :, 0])

input_img.save("input_img.png")
input_stereo_img.save("input_stereo_img.png")
input_gt.save("input_gt.png")

绘制立体图像对和左侧图像的真实深度

from matplotlib import pyplot as plt

input_img = Image.open('input_img.png').convert('RGB')
input_stereo_img = Image.open('input_stereo_img.png').convert('RGB')
input_gt = Image.open('input_gt.png')

fig = plt.figure()
# subplot 1 for left image
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.75)
fig.add_subplot(3, 1, 1)
plt.title("left image")
plt.imshow(input_img)
# subplot 2 for right images
fig.add_subplot(3, 1, 2)
plt.title("right image")
plt.imshow(input_stereo_img)
# subplot 3 for the ground truth
fig.add_subplot(3, 1, 3)
plt.title("ground truth of left input (the reprojection of LiDAR data)")
plt.imshow(input_gt)
# display
plt.show()

Dataloader 将提供一个字典，其中包括原始图像、增强图像、相机内参、相机外参（立体）和真实深度图（用于验证）。

训练细节¶

预测相机姿态

在单目或单目+立体模式下训练网络时，我们必须通过 PoseNet 获取预测的相机姿态。

损失的预测定义如下（请查看完整的 trainer.py 以获得完整实现。）

def predict_poses(self, inputs):
    outputs = {}

    pose_feats = {f_i: inputs["color_aug", f_i, 0] for f_i in self.opt.frame_ids}

    for f_i in self.opt.frame_ids[1:]:
        if f_i != "s":
            # To maintain ordering we always pass frames in temporal order
            if f_i < 0:
                pose_inputs = [pose_feats[f_i], pose_feats[0]]
            else:
                pose_inputs = [pose_feats[0], pose_feats[f_i]]

            axisangle, translation = self.posenet(mx.nd.concat(*pose_inputs, dim=1))
            outputs[("axisangle", 0, f_i)] = axisangle
            outputs[("translation", 0, f_i)] = translation

            # Invert the matrix if the frame id is negative
            outputs[("cam_T_cam", 0, f_i)] = transformation_from_parameters(
                axisangle[:, 0], translation[:, 0], invert=(f_i < 0))

    return outputs

图像重建

为了通过自监督方式训练网络，我们必须根据预测的深度和姿态（或使用立体对的相机外参）从目标图像重建源图像。然后，计算重建的源图像与真实源图像之间的重投影光度损失。

整个过程分为三个步骤：

根据深度和相机内参将目标图像的每个点反投影到 3D 空间；
根据相机外参（姿态）和内参将 3D 点投影到图像平面；
根据投影点从源图像采样像素以重建新图像（利用空间变换网络（STN）确保采样是可微分的）。

反投影（2D 到 3D）定义如下：

class BackprojectDepth(nn.HybridBlock):
    """Layer to transform a depth image into a point cloud
    """

    def __init__(self, batch_size, height, width, ctx=mx.cpu()):
        super(BackprojectDepth, self).__init__()

        self.batch_size = batch_size
        self.height = height
        self.width = width

        self.ctx = ctx

        meshgrid = np.meshgrid(range(self.width), range(self.height), indexing='xy')
        id_coords = np.stack(meshgrid, axis=0).astype(np.float32)
        id_coords = mx.nd.array(id_coords).as_in_context(self.ctx)

        pix_coords = mx.nd.expand_dims(mx.nd.stack(*[id_coords[0].reshape(-1),
                                                     id_coords[1].reshape(-1)], axis=0),
                                       axis=0)
        pix_coords = pix_coords.repeat(repeats=batch_size, axis=0)
        pix_coords = pix_coords.as_in_context(self.ctx)

        with self.name_scope():
            self.id_coords = self.params.get('id_coords', shape=id_coords.shape,
                                             init=mx.init.Zero(), grad_req='null')
            self.id_coords.initialize(ctx=self.ctx)
            self.id_coords.set_data(mx.nd.array(id_coords))

            self.ones = self.params.get('ones',
                                        shape=(self.batch_size, 1, self.height * self.width),
                                        init=mx.init.One(), grad_req='null')
            self.ones.initialize(ctx=self.ctx)

            self.pix_coords = self.params.get('pix_coords',
                                              shape=(self.batch_size, 3, self.height * self.width),
                                              init=mx.init.Zero(), grad_req='null')
            self.pix_coords.initialize(ctx=self.ctx)
            self.pix_coords.set_data(mx.nd.concat(pix_coords, self.ones.data(), dim=1))

    def hybrid_forward(self, F, depth, inv_K, **kwargs):
        cam_points = F.batch_dot(inv_K[:, :3, :3], self.pix_coords.data())
        cam_points = depth.reshape(self.batch_size, 1, -1) * cam_points
        cam_points = F.concat(cam_points, self.ones.data(), dim=1)

        return cam_points

投影（3D 到 2D）定义如下：

class Project3D(nn.HybridBlock):
    """Layer which projects 3D points into a camera with intrinsics K and at position T
    """

    def __init__(self, batch_size, height, width, eps=1e-7):
        super(Project3D, self).__init__()

        self.batch_size = batch_size
        self.height = height
        self.width = width
        self.eps = eps

    def hybrid_forward(self, F, points, K, T):
        P = F.batch_dot(K, T)[:, :3, :]

        cam_points = F.batch_dot(P, points)

        cam_pix = cam_points[:, :2, :] / (cam_points[:, 2, :].expand_dims(1) + self.eps)
        cam_pix = cam_pix.reshape(self.batch_size, 2, self.height, self.width)

        x_src = cam_pix[:, 0, :, :] / (self.width - 1)
        y_src = cam_pix[:, 1, :, :] / (self.height - 1)
        pix_coords = F.concat(x_src.expand_dims(1), y_src.expand_dims(1), dim=1)
        pix_coords = (pix_coords - 0.5) * 2

        return pix_coords

图像重建函数定义如下（请查看完整的 trainer.py 以获得完整实现。）

def generate_images_pred(self, inputs, outputs):
    for scale in self.opt.scales:
        disp = outputs[("disp", scale)]
        if self.opt.v1_multiscale:
            source_scale = scale
        else:
            disp = mx.nd.contrib.BilinearResize2D(disp,
                                                  height=self.opt.height,
                                                  width=self.opt.width)
            source_scale = 0

        _, depth = disp_to_depth(disp, self.opt.min_depth, self.opt.max_depth)
        outputs[("depth", 0, scale)] = depth

        for i, frame_id in enumerate(self.opt.frame_ids[1:]):

            if frame_id == "s":
                T = inputs["stereo_T"]
            else:
                T = outputs[("cam_T_cam", 0, frame_id)]

            cam_points = self.backproject_depth[source_scale](depth,
                                                              inputs[("inv_K", source_scale)])
            pix_coords = self.project_3d[source_scale](cam_points,
                                                       inputs[("K", source_scale)],
                                                       T)

            outputs[("sample", frame_id, scale)] = pix_coords

            outputs[("color", frame_id, scale)] = mx.nd.BilinearSampler(
                data=inputs[("color", frame_id, source_scale)],
                grid=outputs[("sample", frame_id, scale)],
                name='sampler')

            if not self.opt.disable_automasking:
                outputs[("color_identity", frame_id, scale)] = \
                    inputs[("color", frame_id, source_scale)]

训练损失

我们应用标准的重投影损失来训练 Monodepth2。正如 Monodepth2 [Godard19] 所述，重投影损失包括三部分：多尺度重投影光度损失（结合了 L1 损失和 SSIM 损失）、自动遮罩损失以及 Monodepth [Godard17] 中的边缘感知平滑损失。

损失的计算定义如下（请查看完整的 trainer.py 以获得完整实现。）

def compute_losses(self, inputs, outputs):
    """Compute the reprojection and smoothness losses for a minibatch
    """
    losses = {}
    total_loss = 0

    for scale in self.opt.scales:
        loss = 0
        reprojection_losses = []

        if self.opt.v1_multiscale:
            source_scale = scale
        else:
            source_scale = 0

        disp = outputs[("disp", scale)]
        color = inputs[("color", 0, scale)]
        target = inputs[("color", 0, source_scale)]

        for frame_id in self.opt.frame_ids[1:]:
            pred = outputs[("color", frame_id, scale)]
            reprojection_losses.append(self.compute_reprojection_loss(pred, target))

        reprojection_losses = mx.nd.concat(*reprojection_losses, dim=1)

        if not self.opt.disable_automasking:
            identity_reprojection_losses = []
            for frame_id in self.opt.frame_ids[1:]:
                pred = inputs[("color", frame_id, source_scale)]
                identity_reprojection_losses.append(
                    self.compute_reprojection_loss(pred, target))

            identity_reprojection_losses = mx.nd.concat(*identity_reprojection_losses, dim=1)

            if self.opt.avg_reprojection:
                identity_reprojection_loss = \
                    identity_reprojection_losses.mean(axis=1, keepdims=True)
            else:
                # save both images, and do min all at once below
                identity_reprojection_loss = identity_reprojection_losses

        if self.opt.avg_reprojection:
            reprojection_loss = reprojection_losses.mean(axis=1, keepdims=True)
        else:
            reprojection_loss = reprojection_losses

        if not self.opt.disable_automasking:
            # add random numbers to break ties
            identity_reprojection_loss = \
                identity_reprojection_loss + \
                mx.nd.random.randn(*identity_reprojection_loss.shape).as_in_context(
                    identity_reprojection_loss.context) * 0.00001

            combined = mx.nd.concat(identity_reprojection_loss, reprojection_loss, dim=1)
        else:
            combined = reprojection_loss

        if combined.shape[1] == 1:
            to_optimise = combined
        else:
            to_optimise = mx.nd.min(data=combined, axis=1)
            idxs = mx.nd.argmin(data=combined, axis=1)

        if not self.opt.disable_automasking:
            outputs["identity_selection/{}".format(scale)] = (
                    idxs > identity_reprojection_loss.shape[1] - 1).astype('float')

        loss += to_optimise.mean()

        mean_disp = disp.mean(axis=2, keepdims=True).mean(axis=3, keepdims=True)
        norm_disp = disp / (mean_disp + 1e-7)

        smooth_loss = get_smooth_loss(norm_disp, color)

        loss = loss + self.opt.disparity_smoothness * smooth_loss / (2 ** scale)
        total_loss = total_loss + loss
        losses["loss/{}".format(scale)] = loss

    total_loss = total_loss / self.num_scales
    losses["loss"] = total_loss
    return losses

学习率和调度

这里，我们遵循 monodepth2 的标准策略。网络使用 Adam 训练 20 个 epoch。我们使用“步进”学习率调度器进行 Monodepth2 训练，该调度器在 gluoncv.utils.LRScheduler 中提供。前 15 个 epoch 使用 10−4 的学习率，然后在剩余的 epoch 中降低到 10−5。

优化示例定义如下：

lr_scheduler = gluoncv.utils.LRSequential([
    gluoncv.utils.LRScheduler(
        'step', base_lr=1e-4, nepochs=20, iters_per_epoch=len(train_dataset), step_epoch=[15])
])
optimizer_params = {'lr_scheduler': lr_scheduler,
                    'learning_rate': 1e-4}

创建 Adam 优化器

深度和姿态优化器的示例定义如下：

depth_optimizer = gluon.Trainer(model.collect_params(), 'adam', optimizer_params)
pose_optimizer = gluon.Trainer(posenet.collect_params(), 'adam', optimizer_params)

训练循环¶

请查看完整的 trainer.py 以获得完整实现。这是一个训练循环的示例：

def train(self):
    """Run the entire training pipeline
    """
    self.logger.info('Starting Epoch: %d' % self.opt.start_epoch)
    self.logger.info('Total Epochs: %d' % self.opt.num_epochs)

    self.epoch = 0
    for self.epoch in range(self.opt.start_epoch, self.opt.num_epochs):
        self.run_epoch()
        self.val()

    # save final model
    self.save_model("final")
    self.save_model("best")


def run_epoch(self):
    """Run a single epoch of training and validation
    """
    print("Training")
    tbar = tqdm(self.train_loader)
    train_loss = 0.0
    for batch_idx, inputs in enumerate(tbar):
        with autograd.record(True):
            outputs, losses = self.process_batch(inputs)
            mx.nd.waitall()

            autograd.backward(losses['loss'])
        self.depth_optimizer.step(self.opt.batch_size, ignore_stale_grad=True)

        if self.use_pose_net:
            self.pose_optimizer.step(self.opt.batch_size, ignore_stale_grad=True)

        train_loss += losses['loss'].asscalar()
        tbar.set_description('Epoch %d, training loss %.3f' % \
                             (self.epoch, train_loss / (batch_idx + 1)))

        if batch_idx % self.opt.log_frequency == 0:
            self.logger.info('Epoch %d iteration %04d/%04d: training loss %.3f' %
                             (self.epoch, batch_idx, len(self.train_loader),
                              train_loss / (batch_idx + 1)))
        mx.nd.waitall()

您可以立即开始训练。

参考文献¶

Godard17: Clement Godard, Oisin Mac Aodha 和 Gabriel J. Brostow “Unsupervised Monocular Depth Estimation with Left-Right Consistency.” Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2017.
Godard19(1,2,3,4): Clement Godard, Oisin Mac Aodha, Michael Firman 和 Gabriel Brostow. “Digging Into Self-Supervised Monocular Depth Estimation.” Proceedings of the IEEE conference on computer vision (ICCV). 2019.

脚本总运行时间： ( 0 minutes 0.262 seconds)

由 Sphinx-Gallery 生成的图库