注意
点击这里下载完整的示例代码
1. 在 UCF101 上开始使用预训练的 TSN 模型¶
UCF101 是一个包含现实行动视频的行为识别数据集,收集自 YouTube。该数据集包含来自 101 个行动类别的 13,320 个短剪辑视频,是研究界用于基准测试最新视频行为识别模型的最广泛使用的数据集之一。
TSN (Temporal Segment Network) 是一种广泛采用的视频分类方法。它被提出来整合整个视频中的时间信息。其思想直接明了:我们可以将视频均匀地分成几个片段,分别处理每个片段,从每个片段获得片段共识,然后执行最终预测。TSN 更像是一种通用算法,而不是特定的网络架构。它可以与 2D 和 3D 神经网络协同工作。
在本教程中,我们将演示如何从 gluoncv-model-zoo 加载预训练的 TSN 模型,并将来自互联网或本地磁盘的视频帧分类到 101 个行动类别之一。
分步说明¶
我们将在此展示两个示例。为了简单起见,我们首先尝试在单个视频帧上使用预训练的 UCF101 模型。这实际上是一个图像行为识别问题。
首先,如果您尚未安装 MXNet
和 GluonCV
,请按照安装指南进行安装。
import matplotlib.pyplot as plt
import numpy as np
import mxnet as mx
from mxnet import gluon, nd, image
from mxnet.gluon.data.vision import transforms
from gluoncv.data.transforms import video
from gluoncv import utils
from gluoncv.model_zoo import get_model
然后,我们下载并显示示例图像
url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/ThrowDiscus.png'
im_fname = utils.download(url)
img = image.imread(im_fname)
plt.imshow(img.asnumpy())
plt.show()

输出
Downloading ThrowDiscus.png from https://github.com/bryanyzhu/tiny-ucf101/raw/master/ThrowDiscus.png...
0%| | 0/147 [00:00<?, ?KB/s]
148KB [00:00, 42543.83KB/s]
如果您不认识它,这张图片是一个男人正在投掷铁饼。:)
现在我们为图像定义转换。
transform_fn = transforms.Compose([
video.VideoCenterCrop(size=224),
video.VideoToTensor(),
video.VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
这个转换函数执行三个操作:将图像中心裁剪到 224x224 大小,将其转置为 通道数*高度*宽度
,并使用基于 ImageNet 所有图像计算的均值和标准差进行归一化。
转换后的图像看起来像什么?
img_list = transform_fn([img.asnumpy()])
plt.imshow(np.transpose(img_list[0], (1,2,0)))
plt.show()

认不出来?别慌! 我也一样。这种转换使其更“模型友好”,而非“人类友好”。
接下来,我们加载预训练的 VGG16 模型。VGG16 模型是使用 TSN 训练的,包含三个片段。
net = get_model('vgg16_ucf101', nclass=101, pretrained=True)
输出
Downloading /root/.mxnet/models/vgg16-e660d456.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/vgg16-e660d456.zip...
0%| | 0/500159 [00:00<?, ?KB/s]
0%| | 102/500159 [00:00<10:08, 822.39KB/s]
0%| | 508/500159 [00:00<03:42, 2242.36KB/s]
0%| | 2189/500159 [00:00<01:07, 7370.77KB/s]
2%|1 | 7737/500159 [00:00<00:20, 23881.16KB/s]
3%|2 | 14941/500159 [00:00<00:12, 39717.30KB/s]
5%|4 | 23243/500159 [00:00<00:08, 53559.31KB/s]
6%|6 | 30519/500159 [00:00<00:07, 59574.23KB/s]
8%|7 | 38771/500159 [00:00<00:06, 66346.25KB/s]
9%|9 | 47002/500159 [00:00<00:06, 71219.66KB/s]
11%|# | 54249/500159 [00:01<00:07, 60878.94KB/s]
12%|#2 | 62376/500159 [00:01<00:06, 66369.23KB/s]
14%|#3 | 70015/500159 [00:01<00:06, 69153.14KB/s]
16%|#5 | 78414/500159 [00:01<00:05, 73369.37KB/s]
17%|#7 | 86106/500159 [00:01<00:05, 74393.38KB/s]
19%|#8 | 94127/500159 [00:01<00:05, 75654.78KB/s]
20%|## | 102288/500159 [00:01<00:05, 77398.70KB/s]
22%|##2 | 110236/500159 [00:01<00:05, 77659.49KB/s]
24%|##3 | 118677/500159 [00:01<00:04, 79652.40KB/s]
25%|##5 | 126683/500159 [00:02<00:04, 78653.01KB/s]
27%|##6 | 134841/500159 [00:02<00:04, 79473.76KB/s]
29%|##8 | 142888/500159 [00:02<00:04, 79761.82KB/s]
30%|### | 150881/500159 [00:02<00:04, 79466.58KB/s]
32%|###1 | 159188/500159 [00:02<00:04, 80536.33KB/s]
33%|###3 | 167251/500159 [00:02<00:04, 79463.19KB/s]
35%|###5 | 175502/500159 [00:02<00:04, 80363.36KB/s]
37%|###6 | 183546/500159 [00:02<00:03, 79353.28KB/s]
38%|###8 | 191503/500159 [00:02<00:03, 79415.10KB/s]
40%|###9 | 199733/500159 [00:02<00:03, 80266.61KB/s]
42%|####1 | 207765/500159 [00:03<00:03, 79916.81KB/s]
43%|####3 | 216026/500159 [00:03<00:03, 80586.48KB/s]
45%|####4 | 224088/500159 [00:03<00:03, 79786.19KB/s]
46%|####6 | 232091/500159 [00:03<00:03, 79857.17KB/s]
48%|####8 | 240126/500159 [00:03<00:03, 80002.08KB/s]
50%|####9 | 248128/500159 [00:03<00:03, 79393.54KB/s]
51%|#####1 | 256414/500159 [00:03<00:03, 80421.37KB/s]
53%|#####2 | 264459/500159 [00:03<00:02, 79576.44KB/s]
54%|#####4 | 272420/500159 [00:03<00:02, 79223.13KB/s]
56%|#####6 | 280875/500159 [00:03<00:02, 80799.21KB/s]
58%|#####7 | 288959/500159 [00:04<00:02, 79796.98KB/s]
59%|#####9 | 297093/500159 [00:04<00:02, 80241.96KB/s]
61%|######1 | 305121/500159 [00:04<00:02, 79417.63KB/s]
63%|######2 | 313067/500159 [00:04<00:02, 79363.78KB/s]
64%|######4 | 321006/500159 [00:04<00:02, 73621.50KB/s]
66%|######5 | 328449/500159 [00:04<00:02, 71644.60KB/s]
67%|######7 | 337042/500159 [00:04<00:02, 75659.26KB/s]
69%|######8 | 344676/500159 [00:04<00:02, 75509.68KB/s]
71%|####### | 353133/500159 [00:04<00:01, 78129.81KB/s]
72%|#######2 | 361012/500159 [00:05<00:01, 78320.10KB/s]
74%|#######3 | 368875/500159 [00:05<00:01, 78131.49KB/s]
75%|#######5 | 377456/500159 [00:05<00:01, 80398.89KB/s]
77%|#######7 | 385514/500159 [00:05<00:01, 79246.65KB/s]
79%|#######8 | 393522/500159 [00:05<00:01, 79229.79KB/s]
80%|######## | 401952/500159 [00:05<00:01, 80726.96KB/s]
82%|########1 | 410035/500159 [00:05<00:01, 79028.49KB/s]
84%|########3 | 418403/500159 [00:05<00:01, 80284.26KB/s]
85%|########5 | 426444/500159 [00:05<00:00, 79842.60KB/s]
87%|########6 | 434437/500159 [00:05<00:00, 79308.67KB/s]
89%|########8 | 442749/500159 [00:06<00:00, 80432.77KB/s]
90%|######### | 450799/500159 [00:06<00:00, 79286.61KB/s]
92%|#########1| 458835/500159 [00:06<00:00, 79297.79KB/s]
93%|#########3| 467224/500159 [00:06<00:00, 80652.63KB/s]
95%|#########5| 475295/500159 [00:06<00:00, 79412.27KB/s]
97%|#########6| 483796/500159 [00:06<00:00, 81057.99KB/s]
98%|#########8| 491910/500159 [00:06<00:00, 79352.41KB/s]
100%|#########9| 499857/500159 [00:06<00:00, 78741.45KB/s]
100%|##########| 500159/500159 [00:06<00:00, 73971.77KB/s]
Downloading /root/.mxnet/models/vgg16_ucf101-d6dc1bba.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/vgg16_ucf101-d6dc1bba.zip...
0%| | 0/486863 [00:00<?, ?KB/s]
0%| | 101/486863 [00:00<09:25, 861.43KB/s]
0%| | 515/486863 [00:00<03:22, 2398.61KB/s]
0%| | 2187/486863 [00:00<01:02, 7790.81KB/s]
2%|1 | 8017/486863 [00:00<00:18, 25783.45KB/s]
3%|3 | 14838/486863 [00:00<00:11, 40091.01KB/s]
4%|4 | 20502/486863 [00:00<00:10, 45470.18KB/s]
6%|5 | 27947/486863 [00:00<00:08, 52272.19KB/s]
8%|7 | 36694/486863 [00:00<00:07, 62911.05KB/s]
9%|9 | 44678/486863 [00:00<00:06, 68014.72KB/s]
11%|# | 52303/486863 [00:01<00:06, 70489.27KB/s]
12%|#2 | 60337/486863 [00:01<00:05, 73450.34KB/s]
14%|#4 | 68779/486863 [00:01<00:05, 76744.97KB/s]
16%|#5 | 76496/486863 [00:01<00:05, 74867.84KB/s]
17%|#7 | 84022/486863 [00:01<00:05, 74726.04KB/s]
19%|#8 | 92041/486863 [00:01<00:05, 76338.98KB/s]
21%|## | 100279/486863 [00:01<00:04, 78131.49KB/s]
22%|##2 | 108750/486863 [00:01<00:04, 80082.47KB/s]
24%|##4 | 117447/486863 [00:01<00:04, 82129.02KB/s]
26%|##5 | 125725/486863 [00:01<00:04, 82320.97KB/s]
28%|##7 | 134062/486863 [00:02<00:04, 82575.26KB/s]
29%|##9 | 142945/486863 [00:02<00:04, 84438.21KB/s]
31%|###1 | 151394/486863 [00:02<00:04, 83855.77KB/s]
33%|###2 | 159784/486863 [00:02<00:03, 83619.15KB/s]
35%|###4 | 168549/486863 [00:02<00:03, 84817.07KB/s]
36%|###6 | 177034/486863 [00:02<00:03, 84332.44KB/s]
38%|###8 | 185470/486863 [00:02<00:03, 83978.42KB/s]
40%|###9 | 193976/486863 [00:02<00:03, 84298.08KB/s]
42%|####1 | 202604/486863 [00:02<00:03, 84888.09KB/s]
43%|####3 | 211095/486863 [00:02<00:03, 84255.18KB/s]
45%|####5 | 219581/486863 [00:03<00:03, 84412.42KB/s]
47%|####6 | 228131/486863 [00:03<00:03, 84735.36KB/s]
49%|####8 | 236606/486863 [00:03<00:02, 84384.64KB/s]
50%|##### | 245046/486863 [00:03<00:02, 83365.53KB/s]
52%|#####2 | 253403/486863 [00:03<00:02, 83424.59KB/s]
54%|#####3 | 261783/486863 [00:03<00:02, 83535.59KB/s]
56%|#####5 | 270380/486863 [00:03<00:02, 84079.17KB/s]
57%|#####7 | 278873/486863 [00:03<00:02, 84331.63KB/s]
59%|#####9 | 287308/486863 [00:03<00:02, 84046.85KB/s]
61%|###### | 295781/486863 [00:03<00:02, 84248.99KB/s]
63%|######2 | 304439/486863 [00:04<00:02, 84943.53KB/s]
64%|######4 | 312935/486863 [00:04<00:02, 83879.66KB/s]
66%|######5 | 321327/486863 [00:04<00:01, 83619.83KB/s]
68%|######7 | 330319/486863 [00:04<00:01, 85476.39KB/s]
70%|######9 | 338870/486863 [00:04<00:01, 84535.53KB/s]
71%|#######1 | 347328/486863 [00:04<00:01, 84126.56KB/s]
73%|#######3 | 355967/486863 [00:04<00:01, 84795.94KB/s]
75%|#######4 | 364450/486863 [00:04<00:01, 84663.47KB/s]
77%|#######6 | 372919/486863 [00:04<00:01, 84196.48KB/s]
78%|#######8 | 381532/486863 [00:05<00:01, 84767.77KB/s]
80%|######## | 390011/486863 [00:05<00:01, 84564.13KB/s]
82%|########1 | 398469/486863 [00:05<00:01, 83788.37KB/s]
84%|########3 | 406850/486863 [00:05<00:00, 82494.52KB/s]
85%|########5 | 415199/486863 [00:05<00:00, 82784.89KB/s]
87%|########6 | 423482/486863 [00:05<00:00, 82451.18KB/s]
89%|########8 | 432047/486863 [00:05<00:00, 82731.13KB/s]
91%|######### | 440845/486863 [00:05<00:00, 84282.41KB/s]
92%|#########2| 449338/486863 [00:05<00:00, 84472.04KB/s]
94%|#########4| 457788/486863 [00:05<00:00, 83029.48KB/s]
96%|#########5| 466693/486863 [00:06<00:00, 84802.73KB/s]
98%|#########7| 475181/486863 [00:06<00:00, 82984.17KB/s]
99%|#########9| 483492/486863 [00:06<00:00, 82702.06KB/s]
486864KB [00:06, 77608.12KB/s]
请注意,如果您想使用 InceptionV3 系列模型,请将图像调整为两维都大于 299(例如 340x450),并在转换函数中将输入尺寸从 224 更改为 299。最后,我们准备图像并将其输入到模型中。
pred = net(nd.array(img_list[0]).expand_dims(axis=0))
classes = net.classes
topK = 5
ind = nd.topk(pred, k=topK)[0].astype('int')
print('The input video frame is classified to be')
for i in range(topK):
print('\t[%s], with probability %.3f.'%
(classes[ind[i].asscalar()], nd.softmax(pred)[0][ind[i]].asscalar()))
输出
The input video frame is classified to be
[ThrowDiscus], with probability 0.998.
[HorseRace], with probability 0.001.
[VolleyballSpiking], with probability 0.001.
[Hammering], with probability 0.000.
[TennisSwing], with probability 0.000.
我们可以看到,我们的预训练模型以高置信度预测此视频帧是 投掷铁饼
行为。
下一个示例是如何执行视频行为识别,例如,对整个视频使用相同的预训练模型。
首先,我们下载视频并以每秒 1 帧的速度采样视频帧。

from gluoncv.utils import try_import_cv2
cv2 = try_import_cv2()
url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/v_Basketball_g01_c01.avi'
video_fname = utils.download(url)
cap = cv2.VideoCapture(video_fname)
cnt = 0
video_frames = []
while(cap.isOpened()):
ret, frame = cap.read()
cnt += 1
if ret and cnt % 25 == 0:
video_frames.append(frame)
if not ret: break
cap.release()
print('We evenly extract %d frames from the video %s.' % (len(video_frames), video_fname))
输出
Downloading v_Basketball_g01_c01.avi from https://github.com/bryanyzhu/tiny-ucf101/raw/master/v_Basketball_g01_c01.avi...
0%| | 0/281 [00:00<?, ?KB/s]
282KB [00:00, 59973.32KB/s]
We evenly extract 0 frames from the video v_Basketball_g01_c01.avi.
现在我们转换每个视频帧并将其输入到模型中。最后,我们对多个视频帧的预测结果进行平均,以获得合理的预测。
if video_frames:
video_frames_transformed = transform_fn(video_frames)
final_pred = 0
for _, frame_img in enumerate(video_frames_transformed):
pred = net(nd.array(frame_img).expand_dims(axis=0))
final_pred += pred
final_pred /= len(video_frames)
classes = net.classes
topK = 5
ind = nd.topk(final_pred, k=topK)[0].astype('int')
print('The input video is classified to be')
for i in range(topK):
print('\t[%s], with probability %.3f.'%
(classes[ind[i].asscalar()], nd.softmax(final_pred)[0][ind[i]].asscalar()))
我们可以看到,我们的预训练模型以高置信度预测此视频是 篮球
行为。请注意,有许多方法可以采样视频帧并获得最终的视频级预测。