一些题外话：这篇博客源自于实际的项目经历，项目中我负责对各类模型在Qt系统上的部署，从Libtorch到Pytorch再到TensorFlow的模型部署，都浅浅走了一遍，不透彻但能跑通了。

整体介绍：以TensorFlow训练DenseNet121分类CIFAR10的应用场景为例，讲模型在C++环境下的TensorRT加速部署。

零. 环境配置

名称	版本号
TensorRT	TensorRT-7.2.3.4.Windows10.x86_64.cuda-11.1.cudnn8.1
tensorflow-gpu	2.9.1
C++ Compiler	MSVC/14.29.30133
CUDA	11.1
cuDNN	8.4.1
libtorch	libtorch-1.8.2+cu111
pytorch	torch1.12.0+cu113
tf2onnx	1.11.1
opencv	opencv-3.4.13
keras	2.9.0
h5py	3.9.0
Windows	Windows 10 家庭中文版 19044.1889
OpenCV	3.4.13

模型部署整体的流程如下图所示：

可以参考链接：使用 TensorFlow、ONNX 和 TensorRT 加速深度学习推理

一、模型训练及保存

参考这篇博客，训练一个基于keras.application中的DenseNet网络的、处理Cifar10的模型，保存为.hdf5格式。

我们在经典的DenNet121网络前加了resize层，使得网络能接收CIFAR10数据集中32x32x3的数据。代码如下：

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import keras as K
from keras import datasets, layers, models

def preprocess_data(X, Y):
    """pre-processes the data"""
    X_p = X_p = K.applications.densenet.preprocess_input(X)
    """one hot encode target values"""
    Y_p = K.utils.to_categorical(Y, 10)
    return X_p, Y_p

"""load dataset"""
(trainX, trainy), (testX, testy) = K.datasets.cifar10.load_data()
x_train, y_train = preprocess_data(trainX, trainy)
x_test, y_test = preprocess_data(testX, testy)

""" USE DenseNet121"""
OldModel = K.applications.DenseNet121(include_top=False,input_tensor=None,weights='imagenet')
for layer in OldModel.layers[:149]:
    layer.trainable = False
for layer in OldModel.layers[149:]:
    layer.trainable = True

model = K.models.Sequential()

"""a lambda layer that scales up the data to the correct size"""
model.add(K.layers.Lambda(lambda x:K.backend.resize_images(x,height_factor=7,width_factor=7,data_format='channels_last')))

model.add(OldModel)
model.add(K.layers.Flatten())
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(256, activation='relu'))
model.add(K.layers.Dropout(0.7))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(128, activation='relu'))
model.add(K.layers.Dropout(0.5))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(64, activation='relu'))
model.add(K.layers.Dropout(0.3))
model.add(K.layers.Dense(10, activation='softmax'))
"""callbacks"""
# cbacks =  K.callbacks.CallbackList()
# cbacks.append(K.callbacks.ModelCheckpoint(filepath='cifar10.h5',monitor='val_accuracy',save_best_only=True))
# cbacks.append(K.callbacks.EarlyStopping(monitor='val_accuracy',patience=2))

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
"""train"""
model.fit(x=x_train,y=y_train,batch_size=128,epochs=5,validation_data=(x_test, y_test))
model.summary()

model.save('cifar10.h5')

事实上，如果使用这个训练得到的cifar10.h5模型来做下面的转换，在转到trt引擎文件的时候会报错:

[07/28/2022-12:54:39] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
[07/28/2022-12:54:39] [E] Parsing model failed
[07/28/2022-12:54:39] [E] Engine creation failed
[07/28/2022-12:54:39] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec

这是因为目前TensorRt的BUG：#974 (comment)，不支持模型中的resize_image操作。不支持的还有NonZero （op is not supported in TRT yet。）

刚才训练代码里使用的keras.backend.resize_images这个方法使用的是 the nearest model + half_pixel + round_prefer_ceil。

一模一样的issue 。

解决方案：Lambda式子改成model.add(K.layers.Lambda(lambda x:tf.image.resize(x,[224,224])))。

OK，使用Keras的Sequential模型，“搭”自己的网络很快，保存也方便。

二、模型冻结

hdf5模型是可以再次被训练的动态图，现将其冻结转换成pb文件，用于前向计算。

import tensorflow as tf
import keras as K
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

def convert_h5to_pb():
    model = tf.keras.models.load_model("E:/cifar10.h5",compile=False)
    model.summary()
    full_model = tf.function(lambda Input: model(Input))
    full_model = full_model.get_concrete_function(tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))

    # Get frozen ConcreteFunction
    frozen_func = convert_variables_to_constants_v2(full_model)
    frozen_func.graph.as_graph_def()

    layers = [op.name for op in frozen_func.graph.get_operations()]
    print("-" * 50)
    print("Frozen model layers: ")
    for layer in layers:
        print(layer)

    print("-" * 50)
    print("Frozen model inputs: ")
    print(frozen_func.inputs)
    print("Frozen model outputs: ")
    print(frozen_func.outputs)

    # Save frozen graph from frozen ConcreteFunction to hard drive
    tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                      logdir="E:/",
                      name="cifar10.pb",
                      as_text=False)
convert_h5to_pb()

#output
--------------------------------------------------
Frozen model inputs: 
[<tf.Tensor 'Input:0' shape=(None, 32, 32, 3) dtype=float32>]
Frozen model outputs: 
[<tf.Tensor 'Identity:0' shape=(None, 10) dtype=float32>]

三、转onnx文件

使用tf2onnx.convert命令将.pb文件转为.onnx文件：

1	python -m tf2onnx.convert --input E:/cifar10.pb --inputs Input:0 --outputs Identity:0 --output E:/cifar10.onnx --opset 11

–inputs ：模型输入层的名字 –outputs ：模型输出层的名字
输入输出层的名字在冻结代码里可以输出出来。

生成的onnx文件可以在Netron网站进行可视化，查看网络结构。
此时onnx模型的输入向量维度可以通过netron看到是**float32[unk__1220,224,224,3]**,格式是TF的NHWC.

四、生成优化引擎文件

（trtexec的用法，TensorRT - 自带工具trtexec的参数使用说明，官方介绍文档，测试博客）

1	trtexec --onnx=cifar10.onnx --saveEngine=cifar10.trt --workspace=4096 --minShapes=Input:0:1x32x32x3 --optShapes=Input:0:1x32x32x3 --maxShapes=Input:0:50x32x32x3 --fp16

onnx: 输入的onnx模型
saveEngine：转换好后保存的tensorrt engine
workspace：使用的gpu内存，有时候不够，需要手动增大点，单位是MB
minShapes：动态尺寸时的最小尺寸，格式为NCHW，需要给定输入node的名字，
optShapes：推理测试的尺寸，trtexec会执行推理测试，该shape就是测试时的输入shape
maxShapes：动态尺寸时的最大尺寸，这里只有batch是动态的，其他维度都是写死的
fp16：float16推理

五、数据预处理

我们的最终目的是使用引擎对数据进行前向推理。到第四章结束，我们就拿到了最终的“模型”即序列化的引擎文件，下面是对数据的预处理，即加载数据。（我是直接使用了这位佬根据官方MNIST数据集处理代码改写的CIFAR10代码，github链接）

为了满足动态批量的数据输入，可以利用Libtorch的DataLoader类。自定义我们的DataLoader类，只需要重写torch::data::dataset的get和size方法。

这篇文章完全可以让你自学废对自定义数据类型的加载：Custom Data Loading using PyTorch C++ API

假设现在已经写好了CustomDataset类，那么分批喂数据的代码大抵就可以是这样：

// Make DataSet
auto test_dataset = CustomDataset(dataset_path, ".txt", class2label)
    .map(torch::data::transforms::Stack<>());
//Build DataLoader
auto test_data_loader = torch::data::make_data_loader(
    std::move(test_set_transformed), INFERENCE_BATCH);
//const size_t test_dataset_size = test_dataset.size().value();
for (const auto& batch : *test_data_loader){
    torch::Tensor inputs_tensor = batch.data;
    torch::Tensor labels_tensor = batch.target;
    ...
}

六、加载引擎文件

流程：

读取.trt文件到变量.
通过nvinfer1::createInferRuntime创建runtime对象.
调用runtime的deserializeCudaEngine方法反序列化.trt文件得到engine对象.
IExecutionContext* context = engine->createExecutionContext();得到执行上下文对象context.

模型的推理就通过context的enqueueV2方法实现。可以把前三步集合到一个方法中，名叫readTRTfile，方法返回一个engine对象。

之所以不直接取到context后返回context，因为我们需要调用engine的方法查看模型的输入输出维度。

【要点】前文我们生成的模型(得到的pb亦或是pt文件）都是动态批量，得到动态输入的onnx，转为trt时指定了之后推理输入的shape范围，注意只是范围，得到的trt经过deserialize得到engine，在调用engine时需要指定维度。如果没有指定或者维度不对则报错：

1	[E] [TRT] Parameter check failed at: engine.cpp::nvinfer1::rt::ShapeMachineContext::resolveSlots::1318, condition: allInputDimensionsSpecified(routine)

解决办法：

//查看engine的输入输出维度
for (int i = 0; i < engine->getNbBindings(); i++){
    nvinfer1::Dims dims = engine->getBindingDimensions(i);
    printf("index %d, dims: (",i);
    for (int d = 0; d < dims.nbDims; d++){
        if (d < dims.nbDims - 1)	printf("%d,", dims.d[d]);
        else	printf("%d", dims.d[d]);
    }	printf(")\n");
}

以DenseNet121的trt文件为例，以上程序输出

1 2	index 0, dims: (-1,224,224,3) index 1, dims: (-1,100)

所以我们得把输入的动态维度写死，在python里，在调用engine推理前做这样的设置即可:context.set_binding_shape(0, (BATCH, 3, INPUT_H, INPUT_W))，C++代码里应该调用IExecutionContext类型的实例的setBindingDimensions(int bindingIndex, Dims dimensions)方法。

//确定动态维度
nvinfer1::Dims dims4;
dims4.d[0] = 1;    // replace dynamic batch size with 1
dims4.d[1] = 224;
dims4.d[2] = 224;
dims4.d[3] = 3;
dims4.nbDims = 4;
context->setBindingDimensions(0, dims4);

然后再执行推理就可以了。

总体思路是：拿到一个对维度未知的模型engine文件后，首先读入文件内容并做deserialize获得engine。
然后调用getBindingDimensions()查看engine的输入输出维度(如果知道维度就不用)。
在调用context->executeV2()做推理前把维度值为-1的动态维度值替换成具体的维度并调用context->setBindingDimensions()设置具体维度，然后在数据填入input buffer准备好后调用context->executeV2()做推理即可:

为什么是V2，V1V2有什么区别：

execute/enqueue are for implicit batch networks, and executeV2/enqueueV2 are for explicit batch networks. The V2 versions don’t take a batch_size argument since it’s taken from the explicit batch dimension of the network / or from the optimization profile if used.

In TensorRT 7, the ONNX parser requires that you create an explicit batch network, so you’ll have to use V2 methods.

到这里，我们通过readTRTfile函数得到了engine对象，通过engine得到了context对象，然后确定了context输入的动态维度。

七、执行推理

写一个doinference的方法，传入输入和输出数据数组。前文写的DataLoader每批得到的数据都是torch::tensor向量，

cudaMalloc开辟GPU内存。
cudaMemcpyAsync将批数据传给GPU。
调用context.enqueueV2执行推理。
cudaMemcpyAsync将批数据传回CPU。

大致分为这四步。

程序运行结果：

(TrtInfer::testAllSample) test_dataset_size0
loading filename from:E:/cifar10fix.trt
length:47512416
load engine done
deserializing
[08/25/2022-20:37:10] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
deserialize done
The engine in TensorRT.cpp is not nullptr
tensorRT engine created successfully.
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
index 4, dims: (-1,32,32,3)
index 2, dims: (-1,10)
num_running_corrects_NUMS=====2132
num_running_NUMS=====10000
 Eval Loss: 2.23657 Eval Acc: 0.2132
test_dataset_size:()
HAPYY ENDING!!!~~~~~ヾ(≧▽≦*)oヾ(≧▽≦*)oヾ(≧▽≦*)o

代码之后贴出来…笔记推了好久好久，之后继续更

可能遇到的错误：

onnx转trt

1
2

[W] Dynamic dimensions required for input: input_1:0, but no shapes were provided. Automatically overriding shape to: 1x224x224x3
#这是因为Shapes参数处，输入节点的名字有错误，应该是input_1:0而不是input_1。直接和netron上显示的结点name保持一致即可

1 2	[E] [TRT] input_1:0: for dimension number 1 in profile 0 does not match network definition (got min=3, opt=3, max=3), expected min=opt=max=224). #Shapes参数1x3x224x224改成1x224x224x3即可

ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
#模型中resize(nearest-ceil model)算子不支持

1
2

[E] [TRT] C:\source\rtSafe\cuda\cudaConvolutionRunner.cpp (483) - Cudnn Error in nvinfer1::rt::cuda::CudnnConvolutionRunner::executeConv: 2 (CUDNN_STATUS_ALLOC_FAILED)
#--workspace参数设置的太大了  调小一点

【Could not load library cudnn_cnn_infer64_8.dll. Error code 1455.Please make sure cudnn_cnn_infer64_8.dll is in your library path! 】
or 【context null】
原因：内存不足，重启VS或者电脑就OK。（或者参考此问答）

安装chocolatey。

Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

https://zhuanlan.zhihu.com/p/108833705

https://techwizard.cloud/2019/04/13/powershell-tip-exception-calling-downloadstring-with-1-arguments/#:~:text=throwing%20below%20error%3A-,Exception%20calling%20%E2%80%9CDownloadString%E2%80%9D%20with%20%E2%80%9C1%E2%80%9D%20argument(s,script%20will%20resolve%20this%20issue.

添加seuic源：

1
2
3

$  choco source add -n=seuic -s"http://choco.seuic.info/nuget/" 
$  choco source remove -n=chocolatey
$  choco source add -n=chocolatey -s"https://chocolatey.org/api/v2/"  --priority=3

然后choco install fcgiwrap，提示下载失败，源没有这个包。。。。焯

iPadGoodNotes+GoogleDrive+WindowsNotion

需求：在iPad上学习、批注文献，电脑端使用Notion对文献集中管理。文献要实时同步，电脑端要能看到最新的批注情况。

探索了一段时间，尝试过Foxit、Notability、GoodNotes、PDFViewer、iCloud甚至百度云，但综合价格、批注习惯和生态等因素，用GoodNotes+GoogleDrive对文献进行同步是相对最适合我的（0$哈哈哈）。

文献同步

GoodNotes开启GoogleDrive备份。Notes中所有内容会被同结构地备份到GoogleDrive下自动生成的GoodNotes文件夹中，至此实现了文件的云端备份。
顺其自然的，在电脑端直接浏览GoogleDrive里的文献就是最新批注的。
使用网页版GoogleDrive体验不如桌面版的，缓冲时间很不友好，有时网络问题甚至可能打不开了，因此：

下载GoogleDrive Desktop，并将同步模式设置为镜像模式。此时你就可以指定一个目录镜像地存放Drive中所有文件。
之所以不使用Stream这种节省空间的模式是因为他会把文件挂到xx/My Drive/下，路径中有个带空格的My Drive！路径带空格根本忍不了，甚至如果你没有把系统改成英文的（如果你是家庭版Windows还不能改成英文系统！），安装下来的GoogleDrive只会是中文版的，他会把文件挂到xx/我的云盘/下，路径带中文！关键这个路径名不能被更改。

我选择自定义的文件路径为：E:/Google/Drive/，那我GoodNotes里的论文就在本地的E:/Google/Drive/GoodNotes/文件夹下同步存在着。GoodNotes修改论文后自动同步到Drive里，电脑的Drive自动同步后，点击E:/Google/Drive/GoodNotes/xxx.pdf看到的就是最新的批注论文。

但是在电脑端对GoodNotes文件夹下的pdf修改后，GoodNotes是看不到修改的，并且GoodNotes对其修改后会覆盖掉。这个问题我这个方案是无解的，没办法，这受限于GoodNotes操作的文件本质是GoodNotes File而非PDF，鱼和熊掌不可兼得。

文献管理

使用Notion管理文献，需要在使用Notion时跳转到最新的批注文献（本地），比如上文中的E:/Google/Drive/GoodNotes/xxx.pdf，但是非会员限制单个文件<5MB，会员好贵的！！！而且就真的是上传上去了，不能同步更新了。所以通过在Notion中嵌入本地文件链接，直接通过链接打开文件。使用Ngnix。

下载并配置Nginx

下载地址，我选择的是Stable的nginx/Windows-1.22.1。下载并解压。

双击执行文件夹中nginx.exe，浏览器中输入并转到localhost，正常会显示Nginx的欢迎界面。

下面配置安装目录下的conf/nginx.conf，主要是配置location。打开后在原server的location字段下添加新的location字段：

location /goodnotes/ {
	alias   E:/Google/Drive/GoodNotes/;			# 这里最后一个/要加上不然404
	autoindex on;							  # 开启自动适配全资源
}

这里alias和root是有区别的，我用了alias，root与alias主要区别在于nginx如何解释location后面的uri，这使两者分别以不同的方式将请求映射到服务器文件上。参考nginx配置静态资源访问

然后终端中输入nginx -s stop停止服务后start nginx开启服务。

按说nginx -s reload就可以起到更新conf文件后重新加载nginx服务的作用，但是我这边实践证明reload多少有问题，conf没有得到更新还是老内容，所以还是先stop再start了。参考了解决nginx退出后却依然能访问页面的问题

tasklist /fi "IMAGENAME eq nginx.exe" 查看所有运行了的Nginx进程
taskkill /f /pid 16708杀死PID16708的进程

此时浏览器中输入localhost/goodnotes/Interpretable/xxx.pdf就可以打开本机的E:/Google/Drive/GoodNotes/Interpretable/xxx.pdf了。

设置开机自启动nginx服务：右键nginx.exe生成快捷方式，将快捷方式剪切到系统的启动文件夹下。(Win+R输入shell:startup跳转过去)

至此，只实现了对pdf类型的跳转打开，期间尝试了Nginx - Shell Script CGI，但是没成功，网上大部分都是Linux或者苹果的博客。后续我会再试的。

参考文章：

在 Notion 中插入本地文件和目录链接
 Nginx支持web界面执行bash|python等系统命令和脚本
 nginx配置静态资源访问

WYATT'S CAVE

TensorRT & C++ & TensorFlow & DenseNet & CIFAR10模型部署