7. 在自己的数据集上微调 SOTA 视频模型

这是一个使用 Gluon CV 工具包的视频行为识别教程,一个逐步示例。读者应该具备深度学习的基础知识,并熟悉 Gluon API。新用户可以首先学习 A 60-minute Gluon Crash Course。您可以立即开始训练`深入研究`_

微调是在您没有大型标注数据集或没有计算资源从头开始训练模型的情况下,在自己的数据上获得良好视频模型的重要方法。在本教程中,我们提供了一个简单统一的解决方案。您唯一需要准备的是一个包含您的视频信息(例如,视频路径)的文本文件,其余的工作将由我们负责。您可以使用单行命令行从许多流行的预训练模型(例如,I3D、I3D-nonlocal、SlowFast)开始微调。

立即开始训练

注意

您可以随意跳过本教程,因为训练脚本是完整的,可以直接启动。

下载 完整 Python 脚本: train_recognizer.py

有关更多训练命令选项,请运行 python train_recognizer.py -h。请查看模型库以获取重现预训练模型的训练命令。

首先,让我们导入必要的 Python 库。

from __future__ import division

import argparse, time, logging, os, sys, math

import numpy as np
import mxnet as mx
import gluoncv as gcv
from mxnet import gluon, nd, init, context
from mxnet import autograd as ag
from mxnet.gluon import nn
from mxnet.gluon.data.vision import transforms

from gluoncv.data.transforms import video
from gluoncv.data import VideoClsCustom
from gluoncv.model_zoo import get_model
from gluoncv.utils import makedirs, LRSequential, LRScheduler, split_and_load, TrainingHistory

自定义数据加载器

我们提供了一个通用的数据加载器,供您在自己的数据集上使用。您的数据可以存储在任何层次结构中,可以是视频格式或已解码为帧的格式。您唯一需要准备的是一个文本文件,train.txt

如果您的数据以图像格式(已解码为帧)存储。您的 train.txt 应该如下所示

video_001 200 0
video_001 200 0
video_002 300 0
video_003 100 1
video_004 400 2
......
video_100 200 10

每行有三项,用空格分隔。第一项是您的训练视频的路径,例如 video_001。它应该是一个包含 video_001.mp4 帧的文件夹。第二项是每个视频中的帧数,例如 200。第三项是视频的标签,例如 0。

如果您的数据以视频格式存储。您的 train.txt 应该如下所示

video_001.mp4 200 0
video_001.mp4 200 0
video_002.mp4 300 0
video_003.mp4 100 1
video_004.mp4 400 2
......
video_100.mp4 200 10

同样,每行有三项,用空格分隔。第一项是您的训练视频的路径,例如 video_001.mp4。第二项是每个视频中的帧数。但您可以在这里放任何数字,因为我们的视频加载器在训练过程中会自动再次计算帧数。第三项是该视频的标签,例如 0。

一旦您准备好 train.txt,就可以开始了。只需使用我们的通用数据加载器 VideoClsCustom 来加载您的数据。

在本教程中,我们将使用 UCF101 数据集作为示例。对于您自己的数据集,只需将 rootsetting 的值替换为您的数据目录和您准备好的文本文件。让我们首先定义一些基本内容。

num_gpus = 1
ctx = [mx.gpu(i) for i in range(num_gpus)]
transform_train = video.VideoGroupTrainTransform(size=(224, 224), scale_ratios=[1.0, 0.8], mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
per_device_batch_size = 5
num_workers = 0
batch_size = per_device_batch_size * num_gpus

train_dataset = VideoClsCustom(root=os.path.expanduser('~/.mxnet/datasets/ucf101/rawframes'),
                               setting=os.path.expanduser('~/.mxnet/datasets/ucf101/ucfTrainTestlist/ucf101_train_split_1_rawframes.txt'),
                               train=True,
                               new_length=32,
                               transform=transform_train)
print('Load %d training samples.' % len(train_dataset))
train_data = gluon.data.DataLoader(train_dataset, batch_size=batch_size,
                                   shuffle=True, num_workers=num_workers)

输出

Load 9537 training samples.

自定义网络

您可以随时定义自己的网络架构。在这里,我们想展示如何在预训练模型上进行微调。由于 I3D 模型是非常流行的网络,我们将使用在 Kinetics400 数据集上训练的带 ResNet50 主干的 I3D 模型(即 i3d_resnet50_v1_kinetics400)作为示例。

对于简单的微调,人们通常只将最后的分类(全连接)层替换为您数据集中类别数对应的层,而不改变其他部分。在 GluonCV 中,您只需一行代码即可获得定制模型。

net = get_model(name='i3d_resnet50_v1_custom', nclass=101)
net.collect_params().reset_ctx(ctx)
print(net)

输出

conv14_weight is done with shape:  (64, 3, 5, 7, 7)
batchnorm5_gamma is done with shape:  (64,)
batchnorm5_beta is done with shape:  (64,)
batchnorm5_running_mean is done with shape:  (64,)
batchnorm5_running_var is done with shape:  (64,)
layer1_0_conv0_weight is done with shape:  (64, 64, 3, 1, 1)
layer1_0_batchnorm0_gamma is done with shape:  (64,)
layer1_0_batchnorm0_beta is done with shape:  (64,)
layer1_0_batchnorm0_running_mean is done with shape:  (64,)
layer1_0_batchnorm0_running_var is done with shape:  (64,)
layer1_0_conv1_weight is done with shape:  (64, 64, 1, 3, 3)
layer1_0_batchnorm1_gamma is done with shape:  (64,)
layer1_0_batchnorm1_beta is done with shape:  (64,)
layer1_0_batchnorm1_running_mean is done with shape:  (64,)
layer1_0_batchnorm1_running_var is done with shape:  (64,)
layer1_0_conv2_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_0_batchnorm2_gamma is done with shape:  (256,)
layer1_0_batchnorm2_beta is done with shape:  (256,)
layer1_0_batchnorm2_running_mean is done with shape:  (256,)
layer1_0_batchnorm2_running_var is done with shape:  (256,)
layer1_downsample_conv0_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_downsample_batchnorm0_gamma is done with shape:  (256,)
layer1_downsample_batchnorm0_beta is done with shape:  (256,)
layer1_downsample_batchnorm0_running_mean is done with shape:  (256,)
layer1_downsample_batchnorm0_running_var is done with shape:  (256,)
layer1_1_conv0_weight is done with shape:  (64, 256, 3, 1, 1)
layer1_1_batchnorm0_gamma is done with shape:  (64,)
layer1_1_batchnorm0_beta is done with shape:  (64,)
layer1_1_batchnorm0_running_mean is done with shape:  (64,)
layer1_1_batchnorm0_running_var is done with shape:  (64,)
layer1_1_conv1_weight is done with shape:  (64, 64, 1, 3, 3)
layer1_1_batchnorm1_gamma is done with shape:  (64,)
layer1_1_batchnorm1_beta is done with shape:  (64,)
layer1_1_batchnorm1_running_mean is done with shape:  (64,)
layer1_1_batchnorm1_running_var is done with shape:  (64,)
layer1_1_conv2_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_1_batchnorm2_gamma is done with shape:  (256,)
layer1_1_batchnorm2_beta is done with shape:  (256,)
layer1_1_batchnorm2_running_mean is done with shape:  (256,)
layer1_1_batchnorm2_running_var is done with shape:  (256,)
layer1_2_conv0_weight is done with shape:  (64, 256, 3, 1, 1)
layer1_2_batchnorm0_gamma is done with shape:  (64,)
layer1_2_batchnorm0_beta is done with shape:  (64,)
layer1_2_batchnorm0_running_mean is done with shape:  (64,)
layer1_2_batchnorm0_running_var is done with shape:  (64,)
layer1_2_conv1_weight is done with shape:  (64, 64, 1, 3, 3)
layer1_2_batchnorm1_gamma is done with shape:  (64,)
layer1_2_batchnorm1_beta is done with shape:  (64,)
layer1_2_batchnorm1_running_mean is done with shape:  (64,)
layer1_2_batchnorm1_running_var is done with shape:  (64,)
layer1_2_conv2_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_2_batchnorm2_gamma is done with shape:  (256,)
layer1_2_batchnorm2_beta is done with shape:  (256,)
layer1_2_batchnorm2_running_mean is done with shape:  (256,)
layer1_2_batchnorm2_running_var is done with shape:  (256,)
layer2_0_conv0_weight is done with shape:  (128, 256, 3, 1, 1)
layer2_0_batchnorm0_gamma is done with shape:  (128,)
layer2_0_batchnorm0_beta is done with shape:  (128,)
layer2_0_batchnorm0_running_mean is done with shape:  (128,)
layer2_0_batchnorm0_running_var is done with shape:  (128,)
layer2_0_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_0_batchnorm1_gamma is done with shape:  (128,)
layer2_0_batchnorm1_beta is done with shape:  (128,)
layer2_0_batchnorm1_running_mean is done with shape:  (128,)
layer2_0_batchnorm1_running_var is done with shape:  (128,)
layer2_0_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_0_batchnorm2_gamma is done with shape:  (512,)
layer2_0_batchnorm2_beta is done with shape:  (512,)
layer2_0_batchnorm2_running_mean is done with shape:  (512,)
layer2_0_batchnorm2_running_var is done with shape:  (512,)
layer2_downsample_conv0_weight is done with shape:  (512, 256, 1, 1, 1)
layer2_downsample_batchnorm0_gamma is done with shape:  (512,)
layer2_downsample_batchnorm0_beta is done with shape:  (512,)
layer2_downsample_batchnorm0_running_mean is done with shape:  (512,)
layer2_downsample_batchnorm0_running_var is done with shape:  (512,)
layer2_1_conv0_weight is done with shape:  (128, 512, 1, 1, 1)
layer2_1_batchnorm0_gamma is done with shape:  (128,)
layer2_1_batchnorm0_beta is done with shape:  (128,)
layer2_1_batchnorm0_running_mean is done with shape:  (128,)
layer2_1_batchnorm0_running_var is done with shape:  (128,)
layer2_1_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_1_batchnorm1_gamma is done with shape:  (128,)
layer2_1_batchnorm1_beta is done with shape:  (128,)
layer2_1_batchnorm1_running_mean is done with shape:  (128,)
layer2_1_batchnorm1_running_var is done with shape:  (128,)
layer2_1_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_1_batchnorm2_gamma is done with shape:  (512,)
layer2_1_batchnorm2_beta is done with shape:  (512,)
layer2_1_batchnorm2_running_mean is done with shape:  (512,)
layer2_1_batchnorm2_running_var is done with shape:  (512,)
layer2_2_conv0_weight is done with shape:  (128, 512, 3, 1, 1)
layer2_2_batchnorm0_gamma is done with shape:  (128,)
layer2_2_batchnorm0_beta is done with shape:  (128,)
layer2_2_batchnorm0_running_mean is done with shape:  (128,)
layer2_2_batchnorm0_running_var is done with shape:  (128,)
layer2_2_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_2_batchnorm1_gamma is done with shape:  (128,)
layer2_2_batchnorm1_beta is done with shape:  (128,)
layer2_2_batchnorm1_running_mean is done with shape:  (128,)
layer2_2_batchnorm1_running_var is done with shape:  (128,)
layer2_2_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_2_batchnorm2_gamma is done with shape:  (512,)
layer2_2_batchnorm2_beta is done with shape:  (512,)
layer2_2_batchnorm2_running_mean is done with shape:  (512,)
layer2_2_batchnorm2_running_var is done with shape:  (512,)
layer2_3_conv0_weight is done with shape:  (128, 512, 1, 1, 1)
layer2_3_batchnorm0_gamma is done with shape:  (128,)
layer2_3_batchnorm0_beta is done with shape:  (128,)
layer2_3_batchnorm0_running_mean is done with shape:  (128,)
layer2_3_batchnorm0_running_var is done with shape:  (128,)
layer2_3_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_3_batchnorm1_gamma is done with shape:  (128,)
layer2_3_batchnorm1_beta is done with shape:  (128,)
layer2_3_batchnorm1_running_mean is done with shape:  (128,)
layer2_3_batchnorm1_running_var is done with shape:  (128,)
layer2_3_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_3_batchnorm2_gamma is done with shape:  (512,)
layer2_3_batchnorm2_beta is done with shape:  (512,)
layer2_3_batchnorm2_running_mean is done with shape:  (512,)
layer2_3_batchnorm2_running_var is done with shape:  (512,)
layer3_0_conv0_weight is done with shape:  (256, 512, 3, 1, 1)
layer3_0_batchnorm0_gamma is done with shape:  (256,)
layer3_0_batchnorm0_beta is done with shape:  (256,)
layer3_0_batchnorm0_running_mean is done with shape:  (256,)
layer3_0_batchnorm0_running_var is done with shape:  (256,)
layer3_0_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_0_batchnorm1_gamma is done with shape:  (256,)
layer3_0_batchnorm1_beta is done with shape:  (256,)
layer3_0_batchnorm1_running_mean is done with shape:  (256,)
layer3_0_batchnorm1_running_var is done with shape:  (256,)
layer3_0_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_0_batchnorm2_gamma is done with shape:  (1024,)
layer3_0_batchnorm2_beta is done with shape:  (1024,)
layer3_0_batchnorm2_running_mean is done with shape:  (1024,)
layer3_0_batchnorm2_running_var is done with shape:  (1024,)
layer3_downsample_conv0_weight is done with shape:  (1024, 512, 1, 1, 1)
layer3_downsample_batchnorm0_gamma is done with shape:  (1024,)
layer3_downsample_batchnorm0_beta is done with shape:  (1024,)
layer3_downsample_batchnorm0_running_mean is done with shape:  (1024,)
layer3_downsample_batchnorm0_running_var is done with shape:  (1024,)
layer3_1_conv0_weight is done with shape:  (256, 1024, 1, 1, 1)
layer3_1_batchnorm0_gamma is done with shape:  (256,)
layer3_1_batchnorm0_beta is done with shape:  (256,)
layer3_1_batchnorm0_running_mean is done with shape:  (256,)
layer3_1_batchnorm0_running_var is done with shape:  (256,)
layer3_1_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_1_batchnorm1_gamma is done with shape:  (256,)
layer3_1_batchnorm1_beta is done with shape:  (256,)
layer3_1_batchnorm1_running_mean is done with shape:  (256,)
layer3_1_batchnorm1_running_var is done with shape:  (256,)
layer3_1_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_1_batchnorm2_gamma is done with shape:  (1024,)
layer3_1_batchnorm2_beta is done with shape:  (1024,)
layer3_1_batchnorm2_running_mean is done with shape:  (1024,)
layer3_1_batchnorm2_running_var is done with shape:  (1024,)
layer3_2_conv0_weight is done with shape:  (256, 1024, 3, 1, 1)
layer3_2_batchnorm0_gamma is done with shape:  (256,)
layer3_2_batchnorm0_beta is done with shape:  (256,)
layer3_2_batchnorm0_running_mean is done with shape:  (256,)
layer3_2_batchnorm0_running_var is done with shape:  (256,)
layer3_2_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_2_batchnorm1_gamma is done with shape:  (256,)
layer3_2_batchnorm1_beta is done with shape:  (256,)
layer3_2_batchnorm1_running_mean is done with shape:  (256,)
layer3_2_batchnorm1_running_var is done with shape:  (256,)
layer3_2_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_2_batchnorm2_gamma is done with shape:  (1024,)
layer3_2_batchnorm2_beta is done with shape:  (1024,)
layer3_2_batchnorm2_running_mean is done with shape:  (1024,)
layer3_2_batchnorm2_running_var is done with shape:  (1024,)
layer3_3_conv0_weight is done with shape:  (256, 1024, 1, 1, 1)
layer3_3_batchnorm0_gamma is done with shape:  (256,)
layer3_3_batchnorm0_beta is done with shape:  (256,)
layer3_3_batchnorm0_running_mean is done with shape:  (256,)
layer3_3_batchnorm0_running_var is done with shape:  (256,)
layer3_3_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_3_batchnorm1_gamma is done with shape:  (256,)
layer3_3_batchnorm1_beta is done with shape:  (256,)
layer3_3_batchnorm1_running_mean is done with shape:  (256,)
layer3_3_batchnorm1_running_var is done with shape:  (256,)
layer3_3_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_3_batchnorm2_gamma is done with shape:  (1024,)
layer3_3_batchnorm2_beta is done with shape:  (1024,)
layer3_3_batchnorm2_running_mean is done with shape:  (1024,)
layer3_3_batchnorm2_running_var is done with shape:  (1024,)
layer3_4_conv0_weight is done with shape:  (256, 1024, 3, 1, 1)
layer3_4_batchnorm0_gamma is done with shape:  (256,)
layer3_4_batchnorm0_beta is done with shape:  (256,)
layer3_4_batchnorm0_running_mean is done with shape:  (256,)
layer3_4_batchnorm0_running_var is done with shape:  (256,)
layer3_4_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_4_batchnorm1_gamma is done with shape:  (256,)
layer3_4_batchnorm1_beta is done with shape:  (256,)
layer3_4_batchnorm1_running_mean is done with shape:  (256,)
layer3_4_batchnorm1_running_var is done with shape:  (256,)
layer3_4_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_4_batchnorm2_gamma is done with shape:  (1024,)
layer3_4_batchnorm2_beta is done with shape:  (1024,)
layer3_4_batchnorm2_running_mean is done with shape:  (1024,)
layer3_4_batchnorm2_running_var is done with shape:  (1024,)
layer3_5_conv0_weight is done with shape:  (256, 1024, 1, 1, 1)
layer3_5_batchnorm0_gamma is done with shape:  (256,)
layer3_5_batchnorm0_beta is done with shape:  (256,)
layer3_5_batchnorm0_running_mean is done with shape:  (256,)
layer3_5_batchnorm0_running_var is done with shape:  (256,)
layer3_5_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_5_batchnorm1_gamma is done with shape:  (256,)
layer3_5_batchnorm1_beta is done with shape:  (256,)
layer3_5_batchnorm1_running_mean is done with shape:  (256,)
layer3_5_batchnorm1_running_var is done with shape:  (256,)
layer3_5_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_5_batchnorm2_gamma is done with shape:  (1024,)
layer3_5_batchnorm2_beta is done with shape:  (1024,)
layer3_5_batchnorm2_running_mean is done with shape:  (1024,)
layer3_5_batchnorm2_running_var is done with shape:  (1024,)
layer4_0_conv0_weight is done with shape:  (512, 1024, 1, 1, 1)
layer4_0_batchnorm0_gamma is done with shape:  (512,)
layer4_0_batchnorm0_beta is done with shape:  (512,)
layer4_0_batchnorm0_running_mean is done with shape:  (512,)
layer4_0_batchnorm0_running_var is done with shape:  (512,)
layer4_0_conv1_weight is done with shape:  (512, 512, 1, 3, 3)
layer4_0_batchnorm1_gamma is done with shape:  (512,)
layer4_0_batchnorm1_beta is done with shape:  (512,)
layer4_0_batchnorm1_running_mean is done with shape:  (512,)
layer4_0_batchnorm1_running_var is done with shape:  (512,)
layer4_0_conv2_weight is done with shape:  (2048, 512, 1, 1, 1)
layer4_0_batchnorm2_gamma is done with shape:  (2048,)
layer4_0_batchnorm2_beta is done with shape:  (2048,)
layer4_0_batchnorm2_running_mean is done with shape:  (2048,)
layer4_0_batchnorm2_running_var is done with shape:  (2048,)
layer4_downsample_conv0_weight is done with shape:  (2048, 1024, 1, 1, 1)
layer4_downsample_batchnorm0_gamma is done with shape:  (2048,)
layer4_downsample_batchnorm0_beta is done with shape:  (2048,)
layer4_downsample_batchnorm0_running_mean is done with shape:  (2048,)
layer4_downsample_batchnorm0_running_var is done with shape:  (2048,)
layer4_1_conv0_weight is done with shape:  (512, 2048, 3, 1, 1)
layer4_1_batchnorm0_gamma is done with shape:  (512,)
layer4_1_batchnorm0_beta is done with shape:  (512,)
layer4_1_batchnorm0_running_mean is done with shape:  (512,)
layer4_1_batchnorm0_running_var is done with shape:  (512,)
layer4_1_conv1_weight is done with shape:  (512, 512, 1, 3, 3)
layer4_1_batchnorm1_gamma is done with shape:  (512,)
layer4_1_batchnorm1_beta is done with shape:  (512,)
layer4_1_batchnorm1_running_mean is done with shape:  (512,)
layer4_1_batchnorm1_running_var is done with shape:  (512,)
layer4_1_conv2_weight is done with shape:  (2048, 512, 1, 1, 1)
layer4_1_batchnorm2_gamma is done with shape:  (2048,)
layer4_1_batchnorm2_beta is done with shape:  (2048,)
layer4_1_batchnorm2_running_mean is done with shape:  (2048,)
layer4_1_batchnorm2_running_var is done with shape:  (2048,)
layer4_2_conv0_weight is done with shape:  (512, 2048, 1, 1, 1)
layer4_2_batchnorm0_gamma is done with shape:  (512,)
layer4_2_batchnorm0_beta is done with shape:  (512,)
layer4_2_batchnorm0_running_mean is done with shape:  (512,)
layer4_2_batchnorm0_running_var is done with shape:  (512,)
layer4_2_conv1_weight is done with shape:  (512, 512, 1, 3, 3)
layer4_2_batchnorm1_gamma is done with shape:  (512,)
layer4_2_batchnorm1_beta is done with shape:  (512,)
layer4_2_batchnorm1_running_mean is done with shape:  (512,)
layer4_2_batchnorm1_running_var is done with shape:  (512,)
layer4_2_conv2_weight is done with shape:  (2048, 512, 1, 1, 1)
layer4_2_batchnorm2_gamma is done with shape:  (2048,)
layer4_2_batchnorm2_beta is done with shape:  (2048,)
layer4_2_batchnorm2_running_mean is done with shape:  (2048,)
layer4_2_batchnorm2_running_var is done with shape:  (2048,)
dense2_weight is skipped with shape:  (101, 2048)
dense2_bias is skipped with shape:  (101,)
Downloading /root/.mxnet/models/i3d_resnet50_v1_kinetics400-568a722e.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/i3d_resnet50_v1_kinetics400-568a722e.zip...

  0%|          | 0/208483 [00:00<?, ?KB/s]
  0%|          | 102/208483 [00:00<04:14, 818.65KB/s]
  0%|          | 508/208483 [00:00<01:32, 2239.15KB/s]
  1%|1         | 2182/208483 [00:00<00:28, 7283.02KB/s]
  4%|3         | 7858/208483 [00:00<00:08, 24167.70KB/s]
  7%|6         | 14006/208483 [00:00<00:05, 36394.45KB/s]
 11%|#         | 21910/208483 [00:00<00:03, 47696.20KB/s]
 15%|#4        | 30424/208483 [00:00<00:03, 59037.01KB/s]
 18%|#8        | 37624/208483 [00:00<00:02, 62951.34KB/s]
 22%|##1       | 45767/208483 [00:00<00:02, 68519.31KB/s]
 25%|##5       | 53098/208483 [00:01<00:02, 69959.23KB/s]
 29%|##9       | 60963/208483 [00:01<00:02, 72176.72KB/s]
 33%|###3      | 69027/208483 [00:01<00:01, 74707.15KB/s]
 37%|###6      | 76800/208483 [00:01<00:01, 75608.75KB/s]
 41%|####      | 85047/208483 [00:01<00:01, 77663.29KB/s]
 45%|####4     | 92840/208483 [00:01<00:01, 77245.53KB/s]
 48%|####8     | 100584/208483 [00:01<00:01, 75109.40KB/s]
 52%|#####2    | 109213/208483 [00:01<00:01, 78380.57KB/s]
 56%|#####6    | 117077/208483 [00:01<00:01, 77444.67KB/s]
 60%|#####9    | 124984/208483 [00:02<00:01, 77913.63KB/s]
 64%|######3   | 132893/208483 [00:02<00:00, 78260.72KB/s]
 68%|######7   | 140730/208483 [00:02<00:00, 78194.39KB/s]
 72%|#######1  | 149137/208483 [00:02<00:00, 79941.64KB/s]
 75%|#######5  | 157138/208483 [00:02<00:00, 77379.70KB/s]
 79%|#######9  | 165429/208483 [00:02<00:00, 78992.85KB/s]
 83%|########3 | 173348/208483 [00:02<00:00, 77917.39KB/s]
 87%|########6 | 181320/208483 [00:02<00:00, 78442.90KB/s]
 91%|######### | 189177/208483 [00:02<00:00, 78220.59KB/s]
 95%|#########4| 197352/208483 [00:02<00:00, 79260.65KB/s]
 98%|#########8| 205286/208483 [00:03<00:00, 78307.90KB/s]
100%|##########| 208483/208483 [00:03<00:00, 67850.67KB/s]
I3D_ResNetV1(
  (first_stage): HybridSequential(
    (0): Conv3D(3 -> 64, kernel_size=(5, 7, 7), stride=(2, 2, 2), padding=(2, 3, 3), bias=False)
    (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
    (2): Activation(relu)
    (3): MaxPool3D(size=(1, 3, 3), stride=(2, 2, 2), padding=(0, 1, 1), ceil_mode=False, global_pool=False, pool_type=max, layout=NCDHW)
  )
  (pool2): MaxPool3D(size=(2, 1, 1), stride=(2, 1, 1), padding=(0, 0, 0), ceil_mode=False, global_pool=False, pool_type=max, layout=NCDHW)
  (res_layers): HybridSequential(
    (0): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(64 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (2): Activation(relu)
          (3): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (5): Activation(relu)
          (6): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        )
        (conv1): Conv3D(64 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (conv3): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (2): Activation(relu)
          (3): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (5): Activation(relu)
          (6): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        )
        (conv1): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (conv3): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (2): Activation(relu)
          (3): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (5): Activation(relu)
          (6): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        )
        (conv1): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (conv3): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (relu): Activation(relu)
      )
    )
    (1): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(256 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(256 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(256 -> 512, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(512 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
      )
      (3): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
      )
    )
    (2): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(512 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(512 -> 1024, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (3): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (4): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (5): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
    )
    (3): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (2): Activation(relu)
          (3): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (5): Activation(relu)
          (6): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        )
        (conv1): Conv3D(1024 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (conv3): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(1024 -> 2048, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=2048)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(2048 -> 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (2): Activation(relu)
          (3): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (5): Activation(relu)
          (6): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        )
        (conv1): Conv3D(2048 -> 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (conv3): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(2048 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (2): Activation(relu)
          (3): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (5): Activation(relu)
          (6): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        )
        (conv1): Conv3D(2048 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (conv3): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        (relu): Activation(relu)
      )
    )
  )
  (st_avg): GlobalAvgPool3D(size=(1, 1, 1), stride=(1, 1, 1), padding=(0, 0, 0), ceil_mode=True, global_pool=True, pool_type=avg, layout=NCDHW)
  (head): HybridSequential(
    (0): Dropout(p = 0.8, axes=())
    (1): Dense(2048 -> 101, linear)
  )
  (fc): Dense(2048 -> 101, linear)
)

我们还提供了其他自定义网络架构,供您在自己的数据集上使用。您只需将任何预训练模型名称中的 dataset 部分更改为 custom,例如将 slowfast_4x16_resnet50_kinetics400 更改为 slowfast_4x16_resnet50_custom

一旦您为自己的数据集准备好了数据加载器和网络,其余步骤与之前的教程相同。只需定义优化器、损失函数和评估指标,然后开始训练。

优化器、损失函数和评估指标

# Learning rate decay factor
lr_decay = 0.1
# Epochs where learning rate decays
lr_decay_epoch = [40, 80, 100]

# Stochastic gradient descent
optimizer = 'sgd'
# Set parameters
optimizer_params = {'learning_rate': 0.001, 'wd': 0.0001, 'momentum': 0.9}

# Define our trainer for net
trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params)

为了优化我们的模型,我们需要一个损失函数。对于分类任务,我们通常使用 softmax 交叉熵作为损失函数。

loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()

为了简单起见,我们使用准确率作为评估指标来监控我们的训练过程。此外,我们记录评估指标值,并在训练结束时打印它们。

train_metric = mx.metric.Accuracy()
train_history = TrainingHistory(['training-acc'])

训练

完成所有准备工作后,我们终于可以开始训练了!以下是脚本。

注意

为了快速完成本教程,我们仅对 UCF101 进行 3 个 epoch 的微调,每个 epoch 100 次迭代。在您的实验中,您可以根据自己的数据集设置超参数。

epochs = 0
lr_decay_count = 0

for epoch in range(epochs):
    tic = time.time()
    train_metric.reset()
    train_loss = 0

    # Learning rate decay
    if epoch == lr_decay_epoch[lr_decay_count]:
        trainer.set_learning_rate(trainer.learning_rate*lr_decay)
        lr_decay_count += 1

    # Loop through each batch of training data
    for i, batch in enumerate(train_data):
        # Extract data and label
        data = split_and_load(batch[0], ctx_list=ctx, batch_axis=0)
        label = split_and_load(batch[1], ctx_list=ctx, batch_axis=0)

        # AutoGrad
        with ag.record():
            output = []
            for _, X in enumerate(data):
                X = X.reshape((-1,) + X.shape[2:])
                pred = net(X)
                output.append(pred)
            loss = [loss_fn(yhat, y) for yhat, y in zip(output, label)]

        # Backpropagation
        for l in loss:
            l.backward()

        # Optimize
        trainer.step(batch_size)

        # Update metrics
        train_loss += sum([l.mean().asscalar() for l in loss])
        train_metric.update(label, output)

        if i == 100:
            break

    name, acc = train_metric.get()

    # Update history and print metrics
    train_history.update([acc])
    print('[Epoch %d] train=%f loss=%f time: %f' %
        (epoch, acc, train_loss / (i+1), time.time()-tic))

# We can plot the metric scores with:
train_history.plot()
finetune custom

我们可以看到训练准确率迅速提高。实际上,如果您回顾教程 4(深入了解在 Kinetics400 上训练 I3D 模型)并比较训练曲线,您会发现微调可以在更短的时间内获得更好的结果。尝试在您自己的数据集上微调其他 SOTA 视频模型,看看效果如何。

脚本总运行时间: ( 0 分钟 5.860 秒)

由 Sphinx-Gallery 生成的画廊