注意
点击 此处 下载完整的示例代码
3. 使用量化模型进行推理¶
本教程说明了如何在英特尔至强处理器上使用量化的 GluonCV 模型进行推理,以获得更高的性能。
以下示例需要 GluonCV>=0.5
和 MXNet-mkl>=1.6.0b20191010
。如有必要,请遵循我们的安装指南来安装或升级 GluonCV 和 MXNet 的 nightly build 版本。
简介¶
GluonCV 提供了一些量化模型,以提高计算机视觉推理任务的性能并降低部署成本。在实际生产中,低精度 (INT8) 主要有两个好处。首先,低精度指令(如英特尔矢量神经网络指令 (VNNI))可以加速计算。其次,低精度数据类型可以节省内存带宽,实现更好的缓存局部性,并节省电量。在启用 Intel Deep Learning Boost (VNNI) 的硬件上,最新 AWS EC2 C5 实例的新功能可以将性能提升至多 4 倍,而准确率损失小于 0.5%。
请查看 verify_pretrained.py 进行 imagenet 推理,eval_ssd.py 进行 SSD 推理,test.py 进行分割推理,validate.py 进行姿态估计推理,以及 test_recognizer.py 进行视频行为识别。
性能¶
GluonCV 支持一些量化的分类模型、检测模型和分割模型。对于吞吐量,目标是通过将推理请求合并在一起并通过一次迭代获得结果来达到最大机器效率。从柱状图可以清楚地看到,融合和量化方法将选定模型的吞吐量提高了 2.68 倍到 7.24 倍。下方的 CPU 性能是在启用 Intel(R) VNNI 的 AWS EC2 C5.12xlarge 实例(24 个物理核心)上使用模拟输入收集的。

模型 |
数据集 |
批量大小 |
加速比 (INT8/FP32) |
FP32 准确率 |
INT8 准确率 |
---|---|---|---|---|---|
ResNet50 V1 |
ImageNet |
128 |
7.24 |
77.21%/93.55% |
76.08%/93.04% |
MobileNet 1.0 |
ImageNet |
128 |
7.00 |
73.28%/91.22% |
71.94%/90.47% |
SSD-VGG 300* |
VOC |
224 |
5.96 |
77.4 |
77.38 |
SSD-VGG 512* |
VOC |
224 |
5.55 |
78.41 |
78.38 |
SSD-resnet50_v1 512* |
VOC |
224 |
5.03 |
80.21 |
80.25 |
SSD-mobilenet1.0 512* |
VOC |
224 |
3.22 |
75.42 |
74.70 |
FCN_resnet101 |
VOC |
1 |
4.82 |
97.97% |
98.00% |
PSP_resnet101 |
VOC |
1 |
2.68 |
98.46% |
98.45% |
Deeplab_resnet101 |
VOC |
1 |
3.20 |
98.36% |
98.34% |
FCN_resnet101 |
COCO |
1 |
5.05 |
91.28% |
90.96% |
PSP_resnet101 |
COCO |
1 |
2.69 |
91.82% |
91.88% |
Deeplab_resnet101 |
COCO |
1 |
3.27 |
91.86% |
91.98% |
simple_pose_resnet18_v1b |
COCO 关键点 |
128 |
2.55 |
66.3 |
65.9 |
simple_pose_resnet50_v1b |
COCO 关键点 |
128 |
3.50 |
71.0 |
70.6 |
simple_pose_resnet50_v1d |
COCO 关键点 |
128 |
5.89 |
71.6 |
71.4 |
simple_pose_resnet101_v1b |
COCO 关键点 |
128 |
4.07 |
72.4 |
72.2 |
simple_pose_resnet101_v1d |
COCO 关键点 |
128 |
5.97 |
73.0 |
72.7 |
vgg16_ucf101 |
UCF101 |
64 |
4.46 |
81.86 |
81.41 |
inceptionv3_ucf101 |
UCF101 |
64 |
5.16 |
86.92 |
86.55 |
resnet18_v1b_kinetics400 |
Kinetics400 |
64 |
5.24 |
63.29 |
63.14 |
resnet50_v1b_kinetics400 |
Kinetics400 |
64 |
6.78 |
68.08 |
68.15 |
inceptionv3_kinetics400 |
Kinetics400 |
64 |
5.29 |
67.93 |
67.92 |
量化 SSD 模型使用 nms_thresh=0.45
、nms_topk=200
进行评估。对于分割模型,准确率指标是 pixAcc;对于姿态估计模型,准确率指标是不带翻转的 OKS AP。量化 2D 视频行为识别模型使用 num-segments=3
(基于 resnet 的模型使用 7)进行校准。
SSD 演示用法¶
# set omp to use all physical cores of one socket
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export CPUs=`lscpu | grep 'Core(s) per socket' | awk '{print $4}'`
export OMP_NUM_THREADS=$(CPUs)
# with Pascal VOC validation dataset saved on disk
python eval_ssd.py --network=mobilenet1.0 --quantized --data-shape=512 --batch-size=224 --dataset=voc --benchmark
用法
SYNOPSIS
python eval_ssd.py [-h] [--network NETWORK] [--deploy]
[--model-prefix] [--quantized]
[--data-shape DATA_SHAPE] [--batch-size BATCH_SIZE]
[--benchmark BENCHMARK] [--num-iterations NUM_ITERATIONS]
[--dataset DATASET] [--num-workers NUM_WORKERS]
[--num-gpus NUM_GPUS] [--pretrained PRETRAINED]
[--save-prefix SAVE_PREFIX] [--calibration CALIBRATION]
[--num-calib-batches NUM_CALIB_BATCHES]
[--quantized-dtype {auto,int8,uint8}]
[--calib-mode CALIB_MODE]
OPTIONS
-h, --help show this help message and exit
--network NETWORK base network name
--deploy whether load static model for deployment
--model-prefix MODEL_PREFIX
load static model as hybridblock.
--quantized use int8 pretrained model
--data-shape DATA_SHAPE
input data shape
--batch-size BATCH_SIZE
eval mini-batch size
--benchmark BENCHMARK run dummy-data based benchmarking
--num-iterations NUM_ITERATIONS number of benchmarking iterations.
--dataset DATASET eval dataset.
--num-workers NUM_WORKERS, -j NUM_WORKERS
number of data workers
--num-gpus NUM_GPUS number of gpus to use.
--pretrained PRETRAINED
load weights from previously saved parameters.
--save-prefix SAVE_PREFIX
saving parameter prefix
--calibration quantize model
--num-calib-batches NUM_CALIB_BATCHES
number of batches for calibration
--quantized-dtype {auto,int8,uint8}
quantization destination data type for input data
--calib-mode CALIB_MODE
calibration mode used for generating calibration table
for the quantized symbol; supports 1. none: no
calibration will be used. The thresholds for
quantization will be calculated on the fly. This will
result in inference speed slowdown and loss of
accuracy in general. 2. naive: simply take min and max
values of layer outputs as thresholds for
quantization. In general, the inference accuracy
worsens with more examples used in calibration. It is
recommended to use `entropy` mode as it produces more
accurate inference results. 3. entropy: calculate KL
divergence of the fp32 output and quantized output for
optimal thresholds. This mode is expected to produce
the best inference accuracy of all three kinds of
quantized models if the calibration dataset is
representative enough of the inference dataset.
校准工具¶
GluonCV 还提供了校准工具,供用户使用自己的数据集将模型量化到 int8。目前,校准工具仅支持混合 (hybridized) gluon 模型。下面是将 SSD 模型量化到 int8 的示例。
# Calibration
python eval_ssd.py --network=mobilenet1.0 --data-shape=512 --batch-size=224 --dataset=voc --calibration --num-calib-batches=5 --calib-mode=naive
# INT8 Inference
python eval_ssd.py --network=mobilenet1.0 --data-shape=512 --batch-size=224 --deploy --model-prefix=./model/ssd_512_mobilenet1.0_voc-quantized-naive
第一个命令将启动 naive calibration,使用给定数据集的子集(5 个批量)将 ssd_mobilenet1.0 模型量化到 int8。用户可以通过设置不同的校准配置来调整 int8 准确率。校准后,量化模型和参数将保存在磁盘上。然后,第二个命令将加载量化模型作为 symbolblock 进行推理。
用户还可以使用 quantize_net api 量化他们自己的 gluon 混合模型。下面是一些说明。
API
CODE
from mxnet.contrib.quantization import *
quantized_net = quantize_net(network, quantized_dtype='auto',
exclude_layers=None, exclude_layers_match=None,
calib_data=None, data_shapes=None,
calib_mode='naive', num_calib_examples=None,
ctx=mx.cpu(), logger=logging)
Parameters
network : Gluon HybridBlock
Defines the structure of a neural network for FP32 data types.
quantized_dtype : str
The quantized destination type for input data. Currently support 'int8'
, 'uint8' and 'auto'.
'auto' means automatically select output type according to calibration result.
Default value is 'int8'.
exclude_layers : list of strings
A list of strings representing the names of the symbols that users want to excluding
exclude_layers_match : list of strings
A list of strings wildcard matching the names of the symbols that users want to excluding
from being quantized.
calib_data : mx.io.DataIter or gluon.DataLoader
A iterable data loading object.
data_shapes : list
List of DataDesc, required if calib_data is not provided
calib_mode : str
If calib_mode='none', no calibration will be used and the thresholds for
requantization after the corresponding layers will be calculated at runtime by
calling min and max operators. The quantized models generated in this
mode are normally 10-20% slower than those with calibrations during inference.
If calib_mode='naive', the min and max values of the layer outputs from a calibration
dataset will be directly taken as the thresholds for quantization.
If calib_mode='entropy', the thresholds for quantization will be
derived such that the KL divergence between the distributions of FP32 layer outputs and
quantized layer outputs is minimized based upon the calibration dataset.
calib_layer : function
Given a layer's output name in string, return True or False for deciding whether to
calibrate this layer. If yes, the statistics of the layer's output will be collected;
otherwise, no information of the layer's output will be collected. If not provided,
all the layers' outputs that need requantization will be collected.
num_calib_examples : int or None
The maximum number of examples that user would like to use for calibration.
If not provided, the whole calibration dataset will be used.
ctx : Context
Defines the device that users want to run forward propagation on the calibration
dataset for collecting layer output statistics. Currently, only supports single context.
Currently only support CPU with MKL-DNN backend.
logger : Object
A logging object for printing information during the process of quantization.
Returns
network : Gluon SymbolBlock
Defines the structure of a neural network for INT8 data types.
脚本总运行时间: ( 0 分钟 0.000 秒)