QT+TensorRT流水记录

QT

关于数据读取

只是提供了txt格式数据集的载入预览,之后的推理不提供对txt数据的支持。

DataLoader可以参考https://krshrimali.github.io/posts/2019/07/custom-data-loading-using-pytorch-c-api/

和mat数据文件数据读取有关的代码有:chart.cpp'Chart::readHRRPmat trtinfer.cpp'getAllDataFromMat&getDataFromMat

目前默认所有mat文件中要用的数据都放在”hrrp128”这个变量里(因为没有找到读取mat文件中所有变量的方法),当前用到这个默认变量名的方法有:

1
2
3
4
trtInfer:getAllDataFromMat\getDataFromMat
sensepage::nextBatchChart
modelEvalPage::randSample
Chart::readRadiomat\readHrrpmat

trtInfer里的dataSetClc和加载的模型有关联:
在做单样本推理testOneSample的时候,先加载模型,得到模型的输入尺度,这样读数据的时候提供固定大小的数组,用样本数据里的数据填满。
testAllSample的时候,同样先得到模型输入尺度,传给dataSetClc构造函数,使加载的每一个样本数据都和模型要求的输入大小一致。

使用label、classname字典的文件有SocketClient::run、monitorpage、inferThread

数据集中每个样本必须一样长(牵扯的代码比如customdataset.h:getDataSpecifically)

关于模型推理

设置推理批数的时候应该判断是否超过了转trt时--maxShapes参数中设置的批数。

系统载入的afs.trt模型文件来源需要是trainLogs的(trt文件同级目录下要有model文件夹且其中有attention.txt文件,相关代码在ModelEvalPage::testAllSample)。

关于bug

一定要把训练进程杀死之后再退出程序 不然再次启动程序crashed

关于调用MATLAB函数生成的dll

如何在C++程序(工程)中调用Matlab函数
Qt调用MATLAB 生成的dll经验分享
C++调用Matlab生成的DLL动态链接库进行混合编程(win10+VS2015+Matlab2016b)
C++中调用matlab的dll文件(解决初始化失败的问题)

目前的Radio101转Hrrp128的mat代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
function retn = ToHrrp(sPath,dPath)

retn=0;
data6 = load(sPath);
data6 = getfield(data6,'radio101');

X = data6;

mid =size(X);
down_list = 1:1:mid(1);
X = X(down_list,:);

[frqNum,dataNum] = size(X)
win = 'hamming';
N_fft = 2^nextpow2(frqNum);

point = N_fft/frqNum;

w = window(win, frqNum);
Rng0 = ifftshift(ifft(w,N_fft))*point;
maxRng0 = max(abs(Rng0));

x = zeros(N_fft, dataNum);
for n = 1:dataNum
Xw = X(:,n).*w; % 鎵鏁版嵁鍔犵獥
x(:,n) = ifftshift(ifft(Xw,N_fft))*point; %IFFT鍙樻崲鍒版椂鍩?
end
x = x./maxRng0; %鍘婚櫎鍔犵獥瀵瑰箙搴︾殑褰卞搷
hrrp128 = log(abs(x));
% x_dB = log(abs(x))/log(20);
save(dPath,'hrrp128')
retn=1;
end

(1)mex -setup

在弹出的两行选项中选择: mex -setup C++

format,png

(2) mbuild -setup

在弹出的两行选项中选择: mex -setup C++ -client MBUILD

format,png-16606396091991

2.创建一个.m函数,生成C++文件

根据工程需要编写一个.m文件,并按照下列指示生成相应的C++文件。

(1)编写一个名为ZSLAdd.m的函数实现两个数相加

format,png-16606396092002

(2)编译生成C++文件

将Matlab的当前目录打开至存储ZSLAdd.m的文件夹下,在Command Window里输入如下指令:

mcc -W cpplib:ZAdd -T link:lib ZAdd.m -C

加粗字体处更换为自己对应的m函数即可。

等待一段时间,会在当前目录下生成一系列的文件,其中,以下4个后缀名的文件比较重要: .lib, .h, .dll, .ctf。

在Qt中,pro里引入.h文件,.pri里引入动态库-lxxx,ctf、dll、lib三个文件放到构建目录里(release文件夹下)不然会ToHrrpInitialize()初始化失败。

TensorRT Example

使用pytorch作为runtime的全流程:introduce introduce2,在图像分割上的加速应用:tutorial

image-20220708191053551

第一步 得到onnx模型(可选

首先要拿到一个模型的onnx格式,使用pytorch时较简单:

1
2
3
4
5
6
7
8
9
10
11
12
torch.save(model.state_dict(), 'E:\\model_only_par.pt')
onnx_save_path = "E:\\model.onnx"
example_tensor = torch.randn(1, 1, 512).to(device)
torch.onnx.export(model, # model being run
example_tensor, # model input (or a tuple for multiple inputs)
onnx_save_path,
verbose=False, # store the trained parameter weights inside the model file
training=False,
do_constant_folding=True,
input_names=['input'],
output_names=['output']
)

其中model得是初始化了的对象。

使用onnx做一下推理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import onnx,torch
import numpy as np
import onnxruntime as rt

def getTensorFromTXT(filePath):
file_data=[]
t=0
with open(filePath, 'r') as f:
for line in f.readlines():
if(t< 2):
t+=1
continue
a=line[line.rfind(" ")+1:-1]
if(a!=""):
file_data.append(float(a))
outtensor=torch.tensor(file_data)
outtensor=(outtensor-min(outtensor)) / (max(outtensor) - min(outtensor))
outtensor = outtensor.reshape([1, 512])
return outtensor
input=getTensorFromTXT("E:\\207Project\\Data\\HRRP\\Ball_bottom_cone\\21.txt").numpy()
input=input.reshape([1,1,512])
sess = rt.InferenceSession('E:\\model.onnx')
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], {input_name:input.astype(np.float32)})[0]
print(pred_onx)
print(np.argmax(pred_onx))

第二步 得到TensorRt引擎:

ONNX models can be converted to serialized TensorRT engines using the onnx2trt executable:

1
onnx2trt my_model.onnx -o my_engine.trt

也可以通过trtexec命令实现

1
trtexec.exe --onnx=E:/model.onnx --saveEngine=E:/resnet_engine.trt --explicitBatch=1

方法一:使用ONNX

使用现有的ONNX模型,通过TensorRt的Parser解析然后填到网络对象中。步骤:

  1. 建立一个logger日志,必须要有,但又不是那么重要 static Logger gLogger;

  2. 创建一个builder

    1
    IBuilder* builder = createInferBuilder(gLogger);
  3. 创建一个netwok,这时候netWork只是一个空架子

    1
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));
  4. 建立一个 Parser,caffe模型、onnx模型和TF模型都有对应的paser,顾名思义,就用用来解析模型文件的.

    1
    2
    3
    4
    5
    6
    7
    auto parser = nvonnxparser::createParser(*network, gLogger);//这个parser为这个network服务
    // 解析ONNX模型
    std::string onnx_filename = "E:/model.onnx";
    parser->parseFromFile(onnx_filename.c_str(), 2);
    for (int i = 0; i < parser->getNbErrors(); ++i){
    std::cout << parser->getError(i)->desc() << std::endl;
    }
  5. 建立 engine,进行层之间融合或者进度校准方式,可以fp32、fp16或者fp8。方法:Builder(Net+Config)

    1
    2
    3
    4
    5
    // 创建推理引擎
    IBuilderConfig* config = builder->createBuilderConfig();
    config->setMaxWorkspaceSize(1 << 20);
    config->setFlag(nvinfer1::BuilderFlag::kFP16);
    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
  6. 建一个context,这个是用来做inference推断的。上面连接engine,下对应推断数据。

    1
    IExecutionContext* context = engine->createExecutionContext();
  7. 做Inference(涉及到内存开辟传输,最好自己写函数封起来)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
    {
    const ICudaEngine& engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void* buffers[2] = { NULL,NULL };

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    //const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
    //const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);

    // Create GPU buffers on device
    CHECK(cudaMalloc(&buffers[0], batchSize * INPUT_H * INPUT_W * sizeof(float)));
    CHECK(cudaMalloc(&buffers[1], batchSize * OUTPUT_SIZE * sizeof(float)));

    // Create stream
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
    //开始推理
    std::cout << "start to infer image..." << std::endl;
    context.enqueue(batchSize, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

    // Release stream and buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[0]));
    CHECK(cudaFree(buffers[1]));
    std::cout << "Inference Done." << std::endl;
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    // Run inference
    float data[512] = { 0 };
    float prob[5] = { 0 };
    getTensorFromTXT("E:\\207Project\\Data\\HRRP\\DT\\22.txt", data);

    LARGE_INTEGER t1, t2, tc;
    QueryPerformanceFrequency(&tc);
    QueryPerformanceCounter(&t1);
    doInference(*context, data, prob, 1);
    QueryPerformanceCounter(&t2);
    double time = (double)(t2.QuadPart - t1.QuadPart) / (double)tc.QuadPart;
    std::cout << "time = " << time << std::endl; //输出时间(单位:s)

方法二:反序列化trt引擎文件

方法一中如果每次推理都解析onnx会很慢,但在某次创建好engine之后序列化成trt文件保存下来,以后推理可以直接调引擎文件来创引擎,会快一点。

engine保存为.trt文件

1
2
3
4
5
6
7
8
//Save .trt
nvinfer1::IHostMemory* datas = engine->serialize();
std::ofstream file;
file.open("E:/model.trt", std::ios::binary | std::ios::out);
std::cout << "writing engine file..." << std::endl;
file.write((const char*)datas->data(), datas->size());
std::cout << "save engine file done" << std::endl;
file.close();

调取.trt文件创建engine

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
bool read_TRT_File(const std::string& engineFile, IHostMemory*& trtModelStream, ICudaEngine*& engine)
{
std::fstream file;
std::cout << "loading filename from:" << engineFile << std::endl;
nvinfer1::IRuntime* trtRuntime;
//nvonnxparser::IPluginFactory* onnxPlugin = createPluginFactory(gLogger.getTRTLogger());
file.open(engineFile, std::ios::binary | std::ios::in);
file.seekg(0, std::ios::end);
int length = file.tellg();
std::cout << "length:" << length << std::endl;
file.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[length]);
file.read(data.get(), length);
file.close();
std::cout << "load engine done" << std::endl;
std::cout << "deserializing" << std::endl;
trtRuntime = createInferRuntime(gLogger.getTRTLogger());
//ICudaEngine* engine = trtRuntime->deserializeCudaEngine(data.get(), length, onnxPlugin);
engine = trtRuntime->deserializeCudaEngine(data.get(), length, nullptr);
std::cout << "deserialize done" << std::endl;
assert(engine != nullptr);
std::cout << "The engine in TensorRT.cpp is not nullptr" << std::endl;
trtModelStream = engine->serialize();
return true;
}

IHostMemory* modelStream{ nullptr };
ICudaEngine* engine{ nullptr };
if (read_TRT_File("E:/model.trt",modelStream, engine)) std::cout << "tensorRT engine created successfully." << std::endl;
else std::cout << "tensorRT engine created failed." << std::endl;
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);

方法二:By网络权重文件wts

可以参考Github项目知乎博客,创建runtime反序列化engine

TensorRt安装

下载TensorRt并解压,把lib添加到PATH。听说cuda11.1匹配TensorRt7.2.3,因此本文使用7.2.3版本。

将解压后的bin, include, lib\ 目录复制到cuda安装路径下:

安装必要的包:graphsurgeononnx_graphsurgeon,俩包在TensorRt里面都有whl,pip install xx.whl即可。

pycuda

需要下载pycuda包:https://www.lfd.uci.edu/~gohlke/pythonlibs/
本机cuda版本为11.1,因此在上面的网站中我下载了pycuda-2021.1+cuda114-cp38-cp38-win_amd64.whl
在下载目录中执行pip install pycuda-2021.1+cuda114-cp38-cp38-win_amd64.whl以安装此包(取消代理

测试

  1. 用VS打开解压包下的TensorRT-7.2.3\samples\sampleMNIST\sample_mnist.sln工程,然后选择重新生成。
  2. 使用python运行TensorRT-7.2.3\data\mnist下的download_pgms.py程序。
  3. 进入TensorRT-7.2.3\bin目录下,使用cmd命令来运行sample_mnist.exe --datadir=your\path\to\TensorRT-7.2.3\data\mnist\

报错发现cuDNN少了cublasLt64_10.dll和Zlib(Zlib is a data compression software library that is needed by cuDNN)
下载并安装Zlib;在CUDA的bin文件夹下,有个cublasLt64_11.dll,我就copy了一份改名成cublasLt64_10.dll,就不报它的错了。
Add the directory path of zlibwapi.dll to the environment variable PATH. 但是还是报错,于是我直接把dll放进CUDA的bin中。
但还是错,我开始觉得是cuDNN版本的问题,于是把cuDNN版本从8.3.3换到了8.4.1,成功运行!

image-20220707113756252

win+python38环境 import tensorrt

python环境实在导入不了tensorrt,因为按教程要先pip install nvidia-pyindex,但这个包我实在是下不来,最后通过换python版本(3.8.8->3.9.13)才解决下载pyindex的问题。(历时一下午)

现在pip下载tensorrt还是不行,尝试了几乎所有的方法都不行,也有看到说win下不支持python版本的trt(所以放弃import tensorrt,直接使用C++runtime)

模型训练测试及部署

pytorch训练识别hrrp模型的代码

getTensorFromTXT中,list转tensor或者numpy呢?实践证明tensor精度没有numpy的高。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import torch,os
from torch.utils.data import Dataset
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

def load_data_from_folder(datasetPath):
ims, labels, class_list = [], [], []
g = os.walk(r"E:\207Project\Data\HRRP")
for path, dir_list, file_list in g:
for dir_name in dir_list:
class_list.append(dir_name)
class_index = dict(zip(class_list, range(len(class_list))))
print("类别对应序号:")
print(class_index)
g = os.walk(r"E:\207Project\Data\HRRP")
for path, dir_list, file_list in g:
for file_name in file_list:
if (file_name[file_name.rfind('.') + 1:] != "txt"):
continue
ims.append(os.path.join(path, file_name))
im_class = path[path.rfind('\\') + 1:]
labels.append(int(class_index[im_class]))
return ims, labels
def getTensorFromTXT(filePath):
file_data=[]
t=0
with open(filePath, 'r') as f:
for line in f.readlines():
if(t< 2):
t+=1
continue
a=line[line.rfind(" ")+1:-1]
if(a!=""):
file_data.append(float(a))
outtensor=torch.tensor(file_data)
outtensor=(outtensor-min(outtensor)) / (max(outtensor) - min(outtensor))
outtensor = outtensor.reshape([1, 512])
return outtensor

#定义一个数据集
class mDataset(Dataset):
def __init__(self, datasetPath, trainOrtest):
self.ims, self.labels = load_data_from_folder(datasetPath)
def __getitem__(self, index):
im = getTensorFromTXT(self.ims[index])
label = self.labels[index]
return im, label
def __len__(self):
return len(self.ims)

train_dataset= mDataset(r"E:\207Project\Data\HRRP",0)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = 1, shuffle = True, num_workers = 0)


#定义网络
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv1d(1, 16, 5, 1)
self.conv2 = nn.Conv1d(16, 32, 5, 1)
self.fc1 = nn.Linear(4000, 512)
self.fc2 = nn.Linear(512, 5)

def forward(self,x):
x=F.relu(self.conv1(x))
x=F.max_pool1d(x,2)
x=F.relu(self.conv2(x))
x=F.max_pool1d(x,2)
x=x.view(-1,4000)
x=F.relu(self.fc1(x))
x=self.fc2(x)
return F.log_softmax(x,dim=1)

device=torch.device("cuda" if torch.cuda.is_available() else "cpu" )
#device=torch.device("cpu")
model=CNN().to(device)
optimizer=optim.Adam(model.parameters(),lr=1e-3)

def train(model,device,train_loader,optimizer,epoch,losses):
model.train()
for idx,(t_data,t_target) in enumerate(train_loader):
input = Variable(t_data).cuda()
target = Variable(t_target).cuda().long()
pred=model(input)
loss=F.nll_loss(pred,target)
#Adam
optimizer.zero_grad()
loss.backward()
optimizer.step()
if idx%10==0:
print("epoch:{},iteration:{},loss:{}".format(epoch,idx,loss.item()))
losses.append(loss.item())


def test(model,device,test_loader):
model.eval()
correct=0#预测对了几个。
with torch.no_grad():
for idx,(t_data,t_target) in enumerate(test_loader):
t_data,t_target=t_data.to(device),t_target.to(device)
pred=model(t_data)#batch_size*2
pred_class=pred.argmax(dim=1)#batch_size*2->batch_size*1
correct+=pred_class.eq(t_target.view_as(pred_class)).sum().item()
acc=correct/len(test_data)
print("accuracy:{},average_loss:{}".format(acc,average_loss))


num_epochs=5
losses=[]

for epoch in range(num_epochs):
train(model,device,train_loader,optimizer,epoch,losses)

torch.save(model.state_dict(), 'E:\\model_only_par.pt')
onnx_save_path = "E:\\model.onnx"
example_tensor = torch.randn(1, 1, 512).to(device)
torch.onnx.export(model, # model being run
example_tensor, # model input (or a tuple for multiple inputs)
onnx_save_path,
verbose=False, # store the trained parameter weights inside the model file
training=False,
do_constant_folding=True,
input_names=['input'],
output_names=['output']
)

读取pytorch模型 进行推理识别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch,os
from torch.utils.data import Dataset
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable


class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv1d(1, 16, 5, 1)
self.conv2 = nn.Conv1d(16, 32, 5, 1)
self.fc1 = nn.Linear(4000, 512)
self.fc2 = nn.Linear(512, 5) # 这个也不一样,因为是2分类问题。

def forward(self,x):
x=F.relu(self.conv1(x))
x=F.max_pool1d(x,2)
x=F.relu(self.conv2(x))
x=F.max_pool1d(x,2)
x=x.view(-1,4000)
x=F.relu(self.fc1(x))
x=self.fc2(x)
return F.log_softmax(x,dim=1)

device=torch.device("cuda" if torch.cuda.is_available() else "cpu" )
#device=torch.device("cpu")
model=CNN().to(device)
model.load_state_dict(torch.load('E:\\model_only_par.pt'))

model.eval()

def getTensorFromTXT(filePath):
file_data=[]
t=0
with open(filePath, 'r') as f:
for line in f.readlines():
if(t< 2):
t+=1
continue
a=line[line.rfind(" ")+1:-1]
if(a!=""):
file_data.append(float(a))
outtensor=torch.tensor(file_data)
outtensor=(outtensor-min(outtensor)) / (max(outtensor) - min(outtensor))
outtensor = outtensor.reshape([1, 512])
return outtensor
input=getTensorFromTXT("E:\\207Project\\Data\\HRRP\\DT\\21.txt")
input=input.reshape([1,1,512]).to(device)
output=model(input)
print(output)

调用onnx模型进行推理

image-20220710165612452

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import onnx,torch
import numpy as np
import onnxruntime as rt

def getTensorFromTXT(filePath):
file_data=[]
t=0
with open(filePath, 'r') as f:
for line in f.readlines():
if(t< 2):
t+=1
continue
a=line[line.rfind(" ")+1:-1]
if(a!=""):
file_data.append(float(a))
#outtensor=torch.tensor(file_data)
outtensor = np.array(file_data, np.float32)
outtensor=(outtensor-min(outtensor)) / (max(outtensor) - min(outtensor))
outtensor = outtensor.reshape([1, 512])
return outtensor
input=getTensorFromTXT("E:\\207Project\\Data\\HRRP\\Ball_bottom_cone\\21.txt").numpy()
input=input.reshape([1,1,512])
sess = rt.InferenceSession('E:\\model.onnx')
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], {input_name:input.astype(np.float32)})[0]
print(pred_onx)
print(np.argmax(pred_onx))

初次加速推理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
----------------------------------------------------------------
Input filename: E:/model.onnx
ONNX IR version: 0.0.7
Opset version: 9
Producer name: pytorch
Producer version: 1.10
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[07/09/2022-16:20:16] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
tensorRT load onnx mnist model...
[07/09/2022-16:20:17] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[07/09/2022-16:20:57] [W] [TRT] Try increasing the workspace size to 4194304 bytes to get better performance.
[07/09/2022-16:21:07] [W] [TRT] Try increasing the workspace size to 4194304 bytes to get better performance.
[07/09/2022-16:21:11] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[07/09/2022-16:21:11] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
input_blob_name : input
output_blob_name : output
inputH : 1, inputW: 512
start to infer image...
Inference Done.

Output:

-385.375, -181.5, -289.125, -621.375, 0,

D:\code\CPP\tensorrtProj\build\Debug\tensorrtProj.exe (进程 8908)已退出,代码为 0。
要在调试停止时自动关闭控制台,请启用“工具”->“选项”->“调试”->“调试停止时自动关闭控制台”。
按任意键关闭此窗口. . .

V2修改tensor为numpy

torch版本换了一下,换成GPU的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
----------------------------------------------------------------
Input filename: E:/model.onnx
ONNX IR version: 0.0.7
Opset version: 13
Producer name: pytorch
Producer version: 1.12.0
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[07/10/2022-18:26:03] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
tensorRT load onnx mnist model...
[07/10/2022-18:26:05] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[07/10/2022-18:27:00] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[07/10/2022-18:27:00] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
input_blob_name : input
output_blob_name : output
inputH : 1, inputW: 512
start to infer image...
Inference Done.

Output:

-143.479, -205.623, -4.53847, -619.006, -0.0107473,

V3使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
loading filename from:E:/model.trt
length:4129238
load engine done
deserializing
[07/11/2022-14:23:37] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
deserialize done
The engine in TensorRT.cpp is not nullptr
tensorRT engine created successfully.
[07/11/2022-14:23:37] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
start to infer
Inference Done.
time = 0.0018287
Output:
-9.46875, -17.5312, -10.1875, -0.000114679, -21.8438,

C++ onnx转trt和推理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
#include "NvInfer.h"
#include "nvonnxparser.h"
#include "cuda_runtime_api.h"
#include "logging.h"
#include <fstream>
#include <map>
#include <chrono>
#include <algorithm>
#include <Windows.h>

#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)

// stuff we know about the network and the input/output blobs
static const int INPUT_H = 1;
static const int INPUT_W = 512;
static const int OUTPUT_SIZE = 5;

const char* INPUT_BLOB_NAME = "data";
const char* OUTPUT_BLOB_NAME = "prob";

using namespace nvinfer1;

static Logger gLogger;

// Load weights from files shared with TensorRT samples.
// TensorRT weight files have a simple space delimited format:
// [type] [size] <data x size in hex>
void getTensorFromTXT(std::string data_path,float* y) {
int r, n = 0; double d; FILE* f;
float temp[1024];
f = fopen(data_path.c_str(), "r");
for (int i = 0; i < 2; i++) fscanf(f, "%*[^\n]%*c"); // 跳两行
for (int i = 0; i < 1024; i++) {
r = fscanf(f, "%lf", &d);
if (1 == r) temp[n++] = d;
else if (0 == r) fscanf(f, "%*c");
else break;
}
fclose(f);
for (int i = 0; i < 512; i++) {
y[i] = temp[i*2 + 1];
}

std::vector<float> features; //临时特征向量
for (int d = 0; d < 512; ++d)
features.push_back(y[d]);
//特征归一化
float dMaxValue = *std::max_element(features.begin(), features.end()); //求最大值
float dMinValue = *std::min_element(features.begin(), features.end()); //求最小值
for (int f = 0; f < features.size(); ++f) {
y[f] = (y[f] - dMinValue) / (dMaxValue - dMinValue + 1e-8);
}
features.clear();//删除容器
}


void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
//const ICudaEngine& engine = context.getEngine();

//// Pointers to input and output device buffers to pass to engine.
//// Engine requires exactly IEngine::getNbBindings() number of buffers.
//assert(engine.getNbBindings() == 2);
void* buffers[2] = { NULL,NULL };

// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
//const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
//const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);

// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[0], batchSize * INPUT_H * INPUT_W * sizeof(float)));
CHECK(cudaMalloc(&buffers[1], batchSize * OUTPUT_SIZE * sizeof(float)));


/*for (int i = 0; i < batchSize * INPUT_H * INPUT_W; i++) {
std::cout << input[i] << " ";
}std::cout << std::endl<<"输出向量展示完毕"<<std::endl;*/

// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));

// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
//开始推理
std::cout << "start to infer ..." << std::endl;
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);

// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[0]));
CHECK(cudaFree(buffers[1]));
std::cout << "Inference Done." << std::endl;
}
bool read_TRT_File(const std::string& engineFile, IHostMemory*& trtModelStream, ICudaEngine*& engine)
{
std::fstream file;
std::cout << "loading filename from:" << engineFile << std::endl;
nvinfer1::IRuntime* trtRuntime;
//nvonnxparser::IPluginFactory* onnxPlugin = createPluginFactory(gLogger.getTRTLogger());
file.open(engineFile, std::ios::binary | std::ios::in);
file.seekg(0, std::ios::end);
int length = file.tellg();
std::cout << "length:" << length << std::endl;
file.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[length]);
file.read(data.get(), length);
file.close();
std::cout << "load engine done" << std::endl;
std::cout << "deserializing" << std::endl;
trtRuntime = createInferRuntime(gLogger.getTRTLogger());
//ICudaEngine* engine = trtRuntime->deserializeCudaEngine(data.get(), length, onnxPlugin);
engine = trtRuntime->deserializeCudaEngine(data.get(), length, nullptr);
std::cout << "deserialize done" << std::endl;
assert(engine != nullptr);
std::cout << "The engine in TensorRT.cpp is not nullptr" << std::endl;
trtModelStream = engine->serialize();
return true;
}
int main(int argc, char** argv)
{
//IBuilder* builder = createInferBuilder(gLogger);
//nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));
//auto parser = nvonnxparser::createParser(*network, gLogger);
//// 解析ONNX模型
//std::string onnx_filename = "E:/model.onnx";
//parser->parseFromFile(onnx_filename.c_str(), 2);
//for (int i = 0; i < parser->getNbErrors(); ++i)
//{
// std::cout << parser->getError(i)->desc() << std::endl;
//}
//printf("tensorRT load onnx model...\n");
//// 创建推理引擎
//IBuilderConfig* config = builder->createBuilderConfig();
//assert(config != nullptr);
//config->setMaxWorkspaceSize(1 << 22);//4194304
//config->setFlag(nvinfer1::BuilderFlag::kFP16);
//ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
//assert(engine != nullptr);
//IExecutionContext* context = engine->createExecutionContext();
//assert(context != nullptr);
IHostMemory* modelStream{ nullptr };
ICudaEngine* engine{ nullptr };
if (read_TRT_File("E:/model.trt",modelStream, engine)) std::cout << "tensorRT engine created successfully." << std::endl;
else std::cout << "tensorRT engine created failed." << std::endl;
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);


//// 获取输入与输出名称,格式
//const char* input_blob_name = network->getInput(0)->getName();
//const char* output_blob_name = network->getOutput(0)->getName();
//printf("input_blob_name : %s \n", input_blob_name);
//printf("output_blob_name : %s \n", output_blob_name);
//const int inputH = network->getInput(0)->getDimensions().d[1];
//const int inputW = network->getInput(0)->getDimensions().d[2];
//printf("inputH : %d, inputW: %d \n", inputH, inputW);


// Run inference
float data[512] = { 0 };
float prob[5] = { 0 };
getTensorFromTXT("E:\\207Project\\Data\\HRRP\\Ball_bottom_cone\\21.txt", data);

LARGE_INTEGER t1, t2, tc;
QueryPerformanceFrequency(&tc);
QueryPerformanceCounter(&t1);
doInference(*context, data, prob, 1);
QueryPerformanceCounter(&t2);
double time = (double)(t2.QuadPart - t1.QuadPart) / (double)tc.QuadPart;
std::cout << "time = " << time << std::endl; //输出时间(单位:s)

// Print histogram of the output distribution
std::cout << "Output:\n";
for (unsigned int i = 0; i < 5; i++)
{
std::cout << prob[i] << ", ";
}
std::cout << std::endl;

//Save .trt
/*nvinfer1::IHostMemory* datas = engine->serialize();
std::ofstream file;
file.open("E:/model.trt", std::ios::binary | std::ios::out);
std::cout << "writing engine file..." << std::endl;
file.write((const char*)datas->data(), datas->size());
std::cout << "save engine file done" << std::endl;
file.close();*/

// Destroy the engine
context->destroy();
engine->destroy();
return 0;
}

Onnx转trt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#include "NvInfer.h"
#include "nvonnxparser.h"
#include "cuda_runtime_api.h"
#include "logging.h"
#include <fstream>
#include <map>
#include <chrono>
#include <algorithm>
#include <Windows.h>

#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)

using namespace nvinfer1;

static Logger gLogger;


int main(int argc, char** argv)
{
IBuilder* builder = createInferBuilder(gLogger);
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));
auto parser = nvonnxparser::createParser(*network, gLogger);
// 解析ONNX模型
std::string onnx_filename = "E:/model.onnx";
parser->parseFromFile(onnx_filename.c_str(), 2);
for (int i = 0; i < parser->getNbErrors(); ++i)
{
std::cout << parser->getError(i)->desc() << std::endl;
}
printf("tensorRT load onnx model...\n");
// 创建推理引擎
IBuilderConfig* config = builder->createBuilderConfig();
assert(config != nullptr);
config->setMaxWorkspaceSize(1 << 22);//4194304
config->setFlag(nvinfer1::BuilderFlag::kFP16);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
assert(engine != nullptr);
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);

//Save .trt
nvinfer1::IHostMemory* datas = engine->serialize();
std::ofstream file;
file.open("E:/model.trt", std::ios::binary | std::ios::out);
std::cout << "writing engine file..." << std::endl;
file.write((const char*)datas->data(), datas->size());
std::cout << "save engine file done" << std::endl;
file.close();

// Destroy the engine
context->destroy();
engine->destroy();
return 0;
}

pytorch的onnx转trt:

image-20220720214726506

读trt文件然后推理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
#include "NvInfer.h"
#include "nvonnxparser.h"
#include "cuda_runtime_api.h"
#include "logging.h"
#include <fstream>
#include <map>
#include <chrono>
#include <algorithm>
#include <Windows.h>

#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)

// stuff we know about the network and the input/output blobs
static const int INPUT_H = 1;
static const int INPUT_W = 512;
static const int OUTPUT_SIZE = 5;

const char* INPUT_BLOB_NAME = "data";
const char* OUTPUT_BLOB_NAME = "prob";

using namespace nvinfer1;

static Logger gLogger;


void getTensorFromTXT(std::string data_path, float* y) {
int r, n = 0; double d; FILE* f;
float temp[1024];
f = fopen(data_path.c_str(), "r");
for (int i = 0; i < 2; i++) fscanf(f, "%*[^\n]%*c"); // 跳两行
for (int i = 0; i < 1024; i++) {
r = fscanf(f, "%lf", &d);
if (1 == r) temp[n++] = d;
else if (0 == r) fscanf(f, "%*c");
else break;
}
fclose(f);
for (int i = 0; i < 512; i++) {
y[i] = temp[i * 2 + 1];
}

std::vector<float> features; //临时特征向量
for (int d = 0; d < 512; ++d)
features.push_back(y[d]);
//特征归一化
float dMaxValue = *std::max_element(features.begin(), features.end()); //求最大值
float dMinValue = *std::min_element(features.begin(), features.end()); //求最小值
for (int f = 0; f < features.size(); ++f) {
y[f] = (y[f] - dMinValue) / (dMaxValue - dMinValue + 1e-8);
}
features.clear();//删除容器
}


void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
//const ICudaEngine& engine = context.getEngine();

//// Pointers to input and output device buffers to pass to engine.
//// Engine requires exactly IEngine::getNbBindings() number of buffers.
//assert(engine.getNbBindings() == 2);
void* buffers[2] = { NULL,NULL };

// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
CHECK(cudaMalloc(&buffers[0], batchSize * INPUT_H * INPUT_W * sizeof(float)));
CHECK(cudaMalloc(&buffers[1], batchSize * OUTPUT_SIZE * sizeof(float)));
/*for (int i = 0; i < batchSize * INPUT_H * INPUT_W; i++) {
std::cout << input[i] << " ";
}std::cout << std::endl<<"输出向量展示完毕"<<std::endl;*/
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
//开始推理
std::cout << "start to infer ..." << std::endl;
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);

// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[0]));
CHECK(cudaFree(buffers[1]));
std::cout << "Inference Done." << std::endl;
}
bool read_TRT_File(const std::string& engineFile, IHostMemory*& trtModelStream, ICudaEngine*& engine)
{
std::fstream file;
std::cout << "loading filename from:" << engineFile << std::endl;
nvinfer1::IRuntime* trtRuntime;
//nvonnxparser::IPluginFactory* onnxPlugin = createPluginFactory(gLogger.getTRTLogger());
file.open(engineFile, std::ios::binary | std::ios::in);
file.seekg(0, std::ios::end);
int length = file.tellg();
std::cout << "length:" << length << std::endl;
file.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[length]);
file.read(data.get(), length);
file.close();
std::cout << "load engine done" << std::endl;
std::cout << "deserializing" << std::endl;
trtRuntime = createInferRuntime(gLogger.getTRTLogger());
//ICudaEngine* engine = trtRuntime->deserializeCudaEngine(data.get(), length, onnxPlugin);
engine = trtRuntime->deserializeCudaEngine(data.get(), length, nullptr);
std::cout << "deserialize done" << std::endl;
assert(engine != nullptr);
std::cout << "The engine in TensorRT.cpp is not nullptr" << std::endl;
trtModelStream = engine->serialize();
return true;
}
int main(int argc, char** argv)
{
IHostMemory* modelStream{ nullptr };
ICudaEngine* engine{ nullptr };
if (read_TRT_File("E:/model.trt", modelStream, engine)) std::cout << "tensorRT engine created successfully." << std::endl;
else std::cout << "tensorRT engine created failed." << std::endl;
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);
// Run inference
float data[512] = { 0 };
float prob[5] = { 0 };
getTensorFromTXT("E:\\207Project\\Data\\HRRP\\Ball_bottom_cone\\21.txt", data);

LARGE_INTEGER t1, t2, tc;
QueryPerformanceFrequency(&tc);
QueryPerformanceCounter(&t1);
doInference(*context, data, prob, 1);
QueryPerformanceCounter(&t2);
double time = (double)(t2.QuadPart - t1.QuadPart) / (double)tc.QuadPart;
std::cout << "time = " << time << std::endl; //输出时间(单位:s)

// Print histogram of the output distribution
std::cout << "Output:\n";
for (unsigned int i = 0; i < 5; i++)
{
std::cout << prob[i] << ", ";
}
std::cout << std::endl;

// Destroy the engine
context->destroy();
engine->destroy();
return 0;
}

设置为动态输入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#include "NvInfer.h"
#include "nvonnxparser.h"
#include "cuda_runtime_api.h"
#include "logging.h"
#include <fstream>
#include <map>
#include <chrono>
#include <algorithm>
#include <Windows.h>

#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)

using namespace nvinfer1;

static Logger gLogger;

int main(int argc, char** argv)
{
nvinfer1::Dims mPredictionInputDims; //!< The dimensions of the input of the model.
nvinfer1::Dims mPredictionOutputDims; //!< The dimensions of the output of the model.

IBuilder* builder = createInferBuilder(gLogger);
//Creating the preprocessing network
auto preprocessorNetwork = builder->createNetworkV2(1U << static_cast<int32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));
auto input = preprocessorNetwork->addInput("input", nvinfer1::DataType::kFLOAT, Dims4{ -1, 1, -1, -1 });
auto resizeLayer = preprocessorNetwork->addResize(*input);
resizeLayer->setOutputDimensions(mPredictionInputDims);
preprocessorNetwork->markOutput(*resizeLayer->getOutput(0));
//create an empty full-dims network, and parser
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));
auto parser = nvonnxparser::createParser(*network, gLogger);
//parse the model file to populate the network
std::string onnx_filename = "E:/tfmodel_speciInput.onnx";
parser->parseFromFile(onnx_filename.c_str(), 2);
for (int i = 0; i < parser->getNbErrors(); ++i)
{
std::cout << parser->getError(i)->desc() << std::endl;
}
printf("tensorRT load onnx model...\n");

//configure optimization profile & preprocess engine
auto preprocessorConfig = builder->createBuilderConfig();
auto profile = builder->createOptimizationProfile();
profile->setDimensions(input->getName(), OptProfileSelector::kMIN, Dims4{ 1, 1, 1, 1 });
profile->setDimensions(input->getName(), OptProfileSelector::kOPT, Dims4{ 1, 1, 28, 28 });
profile->setDimensions(input->getName(), OptProfileSelector::kMAX, Dims4{ 1, 1, 56, 56 });
preprocessorConfig->addOptimizationProfile(profile);
//Create an optimization profile for calibration
auto profileCalib = builder->createOptimizationProfile();
const int calibBatchSize{ 256 };
profileCalib->setDimensions(input->getName(), OptProfileSelector::kMIN, Dims4{ calibBatchSize, 1, 28, 28 });
profileCalib->setDimensions(input->getName(), OptProfileSelector::kOPT, Dims4{ calibBatchSize, 1, 28, 28 });
profileCalib->setDimensions(input->getName(), OptProfileSelector::kMAX, Dims4{ calibBatchSize, 1, 28, 28 });
preprocessorConfig->setCalibrationProfile(profileCalib);
//Run engine build with config
auto proprocessEngine=builder->buildEngineWithConfig(*preprocessorNetwork, *preprocessorConfig);


// 创建推理引擎
IBuilderConfig* config = builder->createBuilderConfig();
assert(config != nullptr);
config->setMaxWorkspaceSize(1 << 22);//4194304
config->setFlag(nvinfer1::BuilderFlag::kFP16);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
assert(engine != nullptr);
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);

//Save .trt
nvinfer1::IHostMemory* datas = engine->serialize();
std::ofstream file;
file.open("E:/tfmodel_speciInput.trt", std::ios::binary | std::ios::out);
std::cout << "writing engine file..." << std::endl;
file.write((const char*)datas->data(), datas->size());
std::cout << "save engine file done" << std::endl;
file.close();

// Destroy the engine
context->destroy();
engine->destroy();
return 0;
}

TensorFlow模型保存与载入

TF1使用Session 图模式,需要先定义图再执行,流行于工业界。TF2则是Eager模式,同Pytorch,写到哪执行到哪儿,在TF2通过Keras搭配tf.function的训练推理性能很差。

保存为ckpt模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import tensorflow as tf

w1 = tf.Variable(tf.constant(2.0, shape=[1]), name="w1-name")
w2 = tf.Variable(tf.constant(3.0, shape=[1]), name="w2-name")

a = tf.placeholder(dtype=tf.float32, name="a-name")
b = tf.placeholder(dtype=tf.float32, name="b-name")

y = a * w1 + b * w2

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
sess.run(init)
print(a) # Tensor("a-name:0", dtype=float32)
print(b) # Tensor("b-name:0", dtype=float32)
print(y) # Tensor("add:0", dtype=float32)
print(sess.run(y, feed_dict={a: 10, b: 10}))
saver.save(sess, "./model/model.ckpt")

TensorFlow模型会保存在后缀为.ckpt的文件中。保存后在save这个文件夹中实际会出现3个文件,因为TensorFlow会将计算图的结构和图上参数取值分开保存。

  • model.ckpt.meta文件保存了TensorFlow计算图的结构,可以理解为神经网络的网络结构
  • model.ckpt文件保存了TensorFlow程序中每一个变量的取值
  • checkpoint文件保存了一个目录下所有的模型文件列表

image-20220727160501389

加载ckpt模型

1
2
3
4
5
6
7
8
9
10
11
12
13
import tensorflow as tf

saver = tf.train.import_meta_graph("./model/model.ckpt.meta")
graph = tf.get_default_graph()

# 通过 Tensor 名获取变量
a = graph.get_tensor_by_name("a-name:0")
b = graph.get_tensor_by_name("b-name:0")
y = graph.get_tensor_by_name("add:0")

with tf.Session() as sess:
saver.restore(sess, "./model/model.ckpt")
print(sess.run(y, feed_dict={a: 10, b: 10}))

TensorFlow模型部署记录

TensorRT和Tensorflow的数据格式不一样,Tensorflow是NHWC格式,即channel_last,而TensorRT中是NCHW格式,即channel_first,比如一张RGB图像,在Tensorflow中表示为(224, 224, 3),在TensorRT中就是(3,224, 224)。所以使用TensorRT时,请一定确认图像的格式。

TensorFlow模型保存为三种格式:saved_model、checkpoint、graphdef。其中graphdef保存得到的.pb文件网络模型中均为冻结了的常量。

saved_model–>onnx–>trt

1
2
#saved_model格式保存,保存生成的是一个文件夹my_model
model.save('saved_model/my_model')
1
2
#saved_model-->onnx
python -m tf2onnx.convert --saved-model saved_model/my_model --output saved_model/tfmodel.onnx

最后在C++用onnx生成trt引擎的时候报错,因为是输入量为动态的。saved model保存的是一整个训练图,并且参数没有冻结。而只用于模型推理serving并不需要完整的训练图,并且参数不冻结无法进行转TensorRT等极致优化。当然也可以saved_model->frozen pb->saved model来同时利用两者的优点。

补:后来发现命令行直接使用trtexec --onnx=E:/tfmodel.onnx --saveEngine=E:/tfmodel.trt也转成了,就是没测是不是能推理。

两种解决办法:

一:写成动态的(尝试了没成功,参考TensorRTSample)

二:设置成静态的,写死的NCHW; 可以通过保存为.pb类型的模型,pb转onnx再转trt

在这之前我尝试了通过修改onnx输入层的维度,把本来的unk改成了确定的1。但还是识别为动态的。

1
2
3
4
5
6
7
8
9
10
11
12
13
import onnx
import onnx.checker
import onnx.utils
from onnx.tools import update_model_dims

model = onnx.load('E:/tfmodel.onnx')
# 此处可以理解为获得了一个维度 “引用”,通过该 “引用“可以修改其对应的维度
dim_proto0 = model.graph.input[0].type.tensor_type.shape.dim[0]
# 将该维度赋值为字符串,其维度不再为和dummy_input绑定的值
dim_proto0.dim_param = '1'
dim_proto_0 = model.graph.output[0].type.tensor_type.shape.dim[0]
dim_proto_0.dim_param = '1'
onnx.save(model, 'E:/tfmodel_hardInput0.onnx')

但解决不了问题,还是会报错

1
2
#TensorFlow在目标文件夹下保存为.pb模型
tf.keras.models.save_model(model,"E:/tfmodels/")
1
2
3
4
5
6
#.pb-->onnx
python -m tf2onnx.convert --graphdef tensorflow-model-graphdef-file --output model.onnx --inputs input0:0,input1:0 --outputs output0:0
#-graphdef:需要进行转换的pb模型
#--output:转换后的onnx模型名称
#-inputs:pb模型输入层的名字
#--outputs:pb模型输出层的名字

问题在于pb转onnx的时候需要提供输入输出层的结点名称。因此需要使用TensorFlow自带的工具:**summarize_graph,summarize_graph可以查看网络节点,在只有一个固化的权重文件而不知道具体的网络结构时非常有用。下载及使用教程↓**:

首先需要安装bazel,它的功能类似于make。然后下载TensorFlow的源码(推荐使用git clone)。进入源码目录中,执行

1
bazel build tensorflow/tools/graph_transforms:summarize_graph

没成功,报错:ERROR: An error occurred during the fetch of repository 'local_execution_config_python':
根据他的解决办法,是TensorFlow构建的问题,这个网页里提到的都要做到,下载缺少的MSYS2(我是缺少了这个)。(在捣鼓MSY32时,记得使用管理员权限)。其中在configure.py的时候,ROCm和CUDA不能同时选,否则有错。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
PS D:\code\python\tensorflow\tensorflow> python ./configure.py
You have bazel 6.0.0-pre.20220630.1 installed.
Please specify the location of python. [Default is D:\evn\Python39\python.exe]:


Found possible Python library paths:
D:\evn\Python39\lib\site-packages
Please input the desired Python library path to use. Default is [D:\evn\Python39\lib\site-packages]

Do you wish to build TensorFlow with ROCm support? [y/N]: N
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]: y
TensorRT support will be enabled for TensorFlow.

WARNING: TensorRT support on Windows is experimental

Found CUDA 11.1 in:
C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/lib/x64
C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/include
Found cuDNN 8 in:
C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/lib/x64
C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/include
Found TensorRT 7 in:
C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/lib
C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/include


Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. Each capability can be specified as "x.y" or "compute_xy" to include both virtual and binary GPU code, or as "sm_xy" to only include the binary code.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]:


Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is /arch:AVX]:


Would you like to override eigen strong inline for some C++ compilation to reduce the compilation time? [Y/n]: Y
Eigen strong inline overridden.

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=mkl_aarch64 # Build with oneDNN and Compute Library for the Arm Architecture (ACL).
--config=monolithic # Config for mostly static monolithic build.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v1 # Build with TensorFlow 1 API instead of TF 2 API.
Preconfigured Bazel build configs to DISABLE default on features:
--config=nogcp # Disable GCP support.
--config=nonccl # Disable NVIDIA NCCL support.
PS D:\code\python\tensorflow\tensorflow>

然后再执行 还是不行…..:

1
2
3
4
5
6
7
8
9
10
11
ERROR: An error occurred during the fetch of repository 'local_config_python':
Traceback (most recent call last):
File "D:/code/python/tensorflow/tensorflow/third_party/py/python_configure.bzl", line 271, column 40, in _python_autoconf_impl
_create_local_python_repository(repository_ctx)
File "D:/code/python/tensorflow/tensorflow/third_party/py/python_configure.bzl", line 213, column 33, in _create_local_python_repository
python_lib = _get_python_lib(repository_ctx, python_bin)
File "D:/code/python/tensorflow/tensorflow/third_party/py/python_configure.bzl", line 130, column 21, in _get_python_lib
result = execute(repository_ctx, [python_bin, "-c", cmd])
File "D:/code/python/tensorflow/tensorflow/third_party/remote_config/common.bzl", line 230, column 13, in execute
fail(
Error in fail: Repository command failed

保存为pb:
freeze graph(需要output_node_names

通过代码获取到的output_node_names TensorFlow说不对:Freeze graph: node is not in graph

看了很多解决方法,指定了一些output_node_names,但都说图中不存在,因此必须得看pb文件才能确定。

于是去查看saved_model文件夹下的saved_model.pb文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import tensorflow as tf

# model = 'saved_model/dense121/saved_model/saved_model.pb' #请将这里的model.pb文件路径改为自己的
# graph = tf.compat.v1.get_default_graph()
# graph_def = graph.as_graph_def()
# graph_def.ParseFromString(tf.compat.v1.gfile.FastGFile(model, 'rb').read())
# tf.graph_util.import_graph_def(graph_def, name='graph')
# summaryWriter = tf.summary.FileWriter('log/', graph)

with tf.compat.v1.Session() as sess:
with open('saved_model/dense121/saved_model/saved_model.pb', 'rb') as f:
graph_def = tf.compat.v1.GraphDef()
graph_def.ParseFromString(f.read())
print (graph_def)

报错:

1
2
graph_def.ParseFromString(f.read())
google.protobuf.message.DecodeError: Error parsing message with type 'tensorflow.GraphDef'

编码错误,有人说可能是这个pb不全,所以我干脆…不知道

ckpt转pb:https://zhuanlan.zhihu.com/p/64099452

从NVIDIA里的一个博客成功导出了pb和onnx。

將 TensorFlow 模型轉換成 ONNX 檔案的方式有很多種。其中之一是 ResNet50 一節中解釋的方式。Keras 也擁有本身的 Keras 轉 ONNX 檔案轉換器。有時候,TensorFlow 轉 ONNX 不支援某些層,但是 Keras 轉 ONNX 轉換器支援。視 Keras 框架和使用的層類型而定,可能必須在轉換器之間選擇。

model2pb.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import argparse
import tensorflow as tf

def keras_to_pb(model, output_filename, output_node_names):
"""
This is the function to convert the keras model to pb.
Args:
model: The keras model.
output_filename: The output .pb file name.
output_node_names: The output nodes of the network (if None,
the function gets the last layer name as the output node).
"""
sess = tf.compat.v1.keras.backend.get_session()
graph = sess.graph

with graph.as_default():
# Get names of input and output nodes.
in_name = model.layers[0].get_output_at(0).name.split(':')[0]

if output_node_names is None:
output_node_names = [model.layers[-1].get_output_at(0).name.split(':')[0]]

graph_def = graph.as_graph_def()
frozen_graph_def = tf.compat.v1.graph_util.convert_variables_to_constants(
sess,
graph_def,
output_node_names)

sess.close()
wkdir = ''
tf.compat.v1.train.write_graph(frozen_graph_def, wkdir, output_filename, as_text=False)

return in_name, output_node_names


def main(args):
# Disable eager execution in tensorflow 2 is required.
tf.compat.v1.disable_eager_execution()
# Set learning phase to Test.
tf.compat.v1.keras.backend.set_learning_phase(0)

# load ResNet50 model pre-trained on imagenet
model = tf.keras.applications.ResNet50(
include_top=True, weights='imagenet', input_tensor=None,
input_shape=None, pooling=None, classes=1000
)

# Convert keras ResNet50 model to .pb file
in_tensor_name, out_tensor_names = keras_to_pb(model, args.output_pb_file, None)
print(in_tensor_name)
print(out_tensor_names)
# # You can also use keras2onnx
# onnx_model = keras2onnx.convert_keras(model, model.name, target_opset=11)
# keras2onnx.save_model(onnx_model, "resnet.onnx")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--output_pb_file', type=str, default='saved_model/dense121/pb_model/resnet50.pb')
args = parser.parse_args()
main(args)
"""
output:
input_1
['predictions/Softmax']
"""

#得到pb文件后执行以下命令得到onnx
#python -m tf2onnx.convert --input saved_model/dense121/pb_model/resnet50.pb --inputs input_1:0 --outputs predictions/Softmax:0 --output saved_model/dense121/onnx_model/resnet50.onnx --opset 11
#python -m tf2onnx.convert --input saved_model/dense121/pb_model/lxb.pb --inputs Placeholder --outputs save/restore_all --output saved_model/dense121/onnx_model/lxb.onnx --opset 11
#python -m tf2onnx.convert --input E:/cifar10.pb --inputs Input:0 --outputs Identity:0 --output E:/cifar10.onnx --opset 11

从onnx文件中得到输入向量维度信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import engine as eng
from onnx import ModelProto
import tensorrt as trt


engine_name = 'semantic.plan'
onnx_path = "semantic.onnx"
batch_size = 1

model = ModelProto()
with open(onnx_path, "rb") as f:
model.ParseFromString(f.read())

d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
shape = [batch_size , d0, d1 ,d2]
engine = eng.build_engine(onnx_path, shape= shape)
eng.save_engine(engine, engine_name)

onnx转trt

此时onnx模型的输入向量维度可以通过netron看到是**float32[unk__1220,224,224,3]**,格式是TF的NHWC.

trtexec的用法TensorRT - 自带工具trtexec的参数使用说明官方介绍文档测试博客

1
./trtexec --onnx=xxx.onnx --saveEngine=xxx.trt --workspace=1024 --minShapes=inputx:1x3x480x640 --optShapes=inputx:16x3x480x640 --maxShapes=inputx:32x3x480x640 --fp16
1
trtexec --onnx=dense121_output100.onnx --saveEngine=dense121_output100.trt --workspace=4096 --minShapes=input_1:0:1x224x224x3 --optShapes=input_1:0:1x224x224x3 --maxShapes=input_1:0:32x224x224x3 --fp16
1
trtexec --onnx=cifar10.onnx --saveEngine=cifar10.trt --workspace=4096 --minShapes=Input:0:1x32x32x3 --optShapes=Input:0:1x32x32x3 --maxShapes=Input:0:50x32x32x3 --fp16

关于报错:

1
2
[W] Dynamic dimensions required for input: input_1:0, but no shapes were provided. Automatically overriding shape to: 1x224x224x3
#这是因为Shapes参数处,输入节点的名字有错误,应该是input_1:0而不是input_1。直接和netron上显示的结点name保持一致即可
1
2
[E] [TRT] input_1:0: for dimension number 1 in profile 0 does not match network definition (got min=3, opt=3, max=3), expected min=opt=max=224).
#Shapes参数1x3x224x224改成1x224x224x3即可
1
2
3
4
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
#模型中resize(nearest-ceil model)算子不支持
1
2
[E] [TRT] C:\source\rtSafe\cuda\cudaConvolutionRunner.cpp (483) - Cudnn Error in nvinfer1::rt::cuda::CudnnConvolutionRunner::executeConv: 2 (CUDNN_STATUS_ALLOC_FAILED)
#--workspace参数设置的太大了 调小一点
  • onnx: 输入的onnx模型
  • saveEngine:转换好后保存的tensorrt engine
  • workspace:使用的gpu内存,有时候不够,需要手动增大点 单位是MB
  • minShapes:动态尺寸时的最小尺寸,格式为NCHW,需要给定输入node的名字,
  • optShapes:推理测试的尺寸,trtexec会执行推理测试,该shape就是测试时的输入shape
  • maxShapes:动态尺寸时的最大尺寸,这里只有batch是动态的,其他维度都是写死的
  • fp16:float16推理

【要点】动态输入的onnx此时需要指定输入的shape范围,注意只是范围,得到的trt经过deserialize得到engine,在调用engine时需要指定维度。否则报错:

1
[E] [TRT] Parameter check failed at: engine.cpp::nvinfer1::rt::ShapeMachineContext::resolveSlots::1318, condition: allInputDimensionsSpecified(routine)

解决办法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
//查看engine的输入输出维度
for (int i = 0; i < engine->getNbBindings(); i++)
{
nvinfer1::Dims dims = engine->getBindingDimensions(i);
printf("index %d, dims: (",i);
for (int d = 0; d < dims.nbDims; d++)
{
if (d < dims.nbDims - 1)
printf("%d,", dims.d[d]);
else
printf("%d", dims.d[d]);
}
printf(")\n");
}

以DenseNet121的trt文件为例,以上程序输出

1
2
index 0, dims: (-1,224,224,3)
index 1, dims: (-1,100)

所以我们得把输入的动态维度写死,在python里,在调用engine推理前做这样的设置即可:context.set_binding_shape(0, (BATCH, 3, INPUT_H, INPUT_W)),C++代码里应该调用IExecutionContext类型的实例的setBindingDimensions(int bindingIndex, Dims dimensions)方法。

1
2
3
4
5
6
7
8
//确定动态维度
nvinfer1::Dims dims4;
dims4.d[0] = 1; // replace dynamic batch size with 1
dims4.d[1] = 224;
dims4.d[2] = 224;
dims4.d[3] = 3;
dims4.nbDims = 4;
context->setBindingDimensions(0, dims4);

然后再执行推理就可以了。

总体思路是:拿到一个对维度未知的模型engine文件后,首先读入文件内容并做deserialize获得engine。
然后调用getBindingDimensions()查看engine的输入输出维度(如果知道维度就不用)。
在调用context->executeV2()做推理前把维度值为-1的动态维度值替换成具体的维度并调用context->setBindingDimensions()设置具体维度,然后在数据填入input buffer准备好后调用context->executeV2()做推理即可:

为什么是V2,V1V2有什么区别:

execute/enqueue are for implicit batch networks, and executeV2/enqueueV2 are for explicit batch networks. The V2 versions don’t take a batch_size argument since it’s taken from the explicit batch dimension of the network / or from the optimization profile if used.

In TensorRT 7, the ONNX parser requires that you create an explicit batch network, so you’ll have to use V2 methods.

以CIFAR10为例,训练模型并部署测试(C++)

参考文章:

tensorflow2 cifar10 模型训练 demo
使用 TensorFlow、ONNX 和 NVIDIA TensorRT 加快深度學習推論
[深度学习] TensorFlow中模型的freeze_graph
TensorFlow模型保存和加载方法
使用TF实现DenseNet并在CIFAR10数据集上进行分类任务
TensorFlow网络模型移植&训练指南 01
tensorflow 模型持久化(ckpt转pb模型)
Keras训练的h5文件转pb文件并用Tensorflow加载
TensorFlow2.0模型格式转换为.pb格式
推理演示 | 八步助你搞定tensorRT C++ SDK调用!
【tensorrt】——trtexec动态batch支持与batch推理耗时评测
Save, Load and Inference From TensorFlow 2.x Frozen Graph

成功的单样本测试

保存了随机初始化权重的pb模型:dense121_output100.pb,根据上面的流程转成了dense121_output100.trt引擎文件。python调取pb对网上下载的一个plane图片做推理,代码如下:(参考

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import tensorflow.compat.v1 as tf
from tensorflow.python.platform import gfile
import numpy as np
import cv2
import time

config = tf.ConfigProto()
sess = tf.Session(config=config)
with gfile.FastGFile(r'D:\code\python\pycharmProject\PytorchProj\saved_model\dense121\pb_model\dense121_output100.pb', 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
sess.graph.as_default()
tf.import_graph_def(graph_def, name='')
opname = [tensor.name for tensor in tf.get_default_graph().as_graph_def().node]
print(opname) #查看pb nodename
# 获取输入tensor
x = tf.get_default_graph().get_tensor_by_name(
"input_1:0") # 不知道输入名时通过节点名查,一般情况下是每一个节点tf.get_default_graph().as_graph_def().node[0].name,名字构成后有个:0
print("input:", x)
# 获取预测tensor
pred = tf.get_default_graph().get_tensor_by_name(
"predictions/Softmax:0") # tf.get_default_graph().as_graph_def().node[-1].name,有可能不是是最后一一个
print(pred)

tx=cv2.imread("E:/plane.jpg")
pre = sess.run(pred, feed_dict={x: tx.reshape(1, 224, 224, 3) / 255}) # 预测直接run输出,传入输入

print("Prediction: " + str(pre))
print(pre.sum())

#output

['input_1', 'zero_padding2d/Pad/paddings', 'zero_padding2d/Pad', 'conv1/conv/kernel',
。。。
'predictions/MatMul/ReadVariableOp', 'predictions/MatMul', 'predictions/BiasAdd/ReadVariableOp', 'predictions/BiasAdd', 'predictions/Softmax']
input: Tensor("input_1:0", shape=(None, 224, 224, 3), dtype=float32)
Tensor("predictions/Softmax:0", shape=(None, 100), dtype=float32)

Prediction: [[0.00983849 0.01052127 0.00983031 0.01043278 0.00971659 0.01052411
0.00976097 0.01017774 0.01003768 0.01022743 0.01027896 0.00987673
0.01036694 0.00980142 0.01010968 0.01021501 0.00979544 0.00993549
0.00994751 0.01062134 0.0098254 0.01007287 0.0099517 0.01028203
0.00993329 0.01002692 0.01005279 0.01040414 0.00987132 0.00988404
0.01029295 0.01014602 0.00990441 0.00971152 0.00996019 0.00965257
0.01010645 0.00970931 0.00982063 0.00973994 0.01010571 0.00984999
0.00968821 0.01060284 0.00984734 0.01027847 0.00975892 0.00997673
0.00992283 0.00980057 0.01023249 0.00982915 0.01070345 0.00975009
0.00978433 0.01057807 0.0097995 0.00960496 0.01003811 0.0094706
0.00983578 0.00977461 0.01003506 0.00966216 0.01028053 0.01002804
0.01030125 0.01011671 0.00976537 0.0093752 0.00992731 0.00997646
0.01008964 0.00983203 0.00982056 0.01011153 0.01021339 0.01072151
0.00976963 0.01050529 0.01019201 0.01032242 0.01020801 0.00998539
0.00993438 0.00952398 0.00938275 0.00991478 0.01002662 0.01032722
0.01019795 0.00952248 0.00968466 0.0100937 0.00989739 0.009971
0.01018309 0.00970648 0.01000668 0.00979022]]
1.0

关于查看.pbnodename

除了上面的opname = [tensor.name for tensor in tf.get_default_graph().as_graph_def().node]

还可以通过tf.train.write_graph(sess.graph_def, './pb_model', 'lxbmodel.pb')生成模型文件(文本文档),打开就能看结点。

还可以:

1
2
print(tf.get_default_graph().as_graph_def())
#返回各个计算节点的详细信息,下面展示其中一个节点的信息

上面的代码中对图片的处理只是cv.imread后tx.reshape(1, 224, 224, 3) / 255},C++中(mat数据转换参考

1
2
3
4
5
6
7
8
9
10
11
cv::Mat image = cv::imread("E:/plane.jpg");
cv::Mat img2;
image.convertTo(img2, CV_32F);
img2 = img2 / 255;
std::vector<float> vecHeight;
vecHeight.assign((float*)img2.data, (float*)img2.data + img2.total() * img2.channels());
float* input = new float[vecHeight.size()];
if (!vecHeight.empty()){
memcpy(input, &vecHeight[0], vecHeight.size() * sizeof(float));
}
//input即要传给context的float[]

这个array打印出来是和python无差的。现在做到输入数据一致了。目前python调.pb和C++调.trtplane.jpg图像推理得到的概率向量一致。使用的C++代码:(其中要注意容易错过的简单错误 比如INPUT_H等全局变量要改。)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
#include "NvInfer.h"
#include "nvonnxparser.h"
#include "cuda_runtime_api.h"
#include "logging.h"
#include <fstream>
#include <map>
#include <chrono>
#include <algorithm>
#include <Windows.h>
#include <opencv2/opencv.hpp>

#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)

// stuff we know about the network and the input/output blobs
const int INPUT_C = 3;
static const int INPUT_H = 224;
static const int INPUT_W = 224;
static const int OUTPUT_SIZE = 100;

const char* INPUT_BLOB_NAME = "data";
const char* OUTPUT_BLOB_NAME = "prob";

using namespace nvinfer1;

static Logger gLogger;

void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
void* buffers[2] = { NULL,NULL };

CHECK(cudaMalloc(&buffers[0], batchSize * INPUT_H * INPUT_W * INPUT_C* sizeof(float)));
CHECK(cudaMalloc(&buffers[1], batchSize * OUTPUT_SIZE * sizeof(float)));

cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * INPUT_H * INPUT_W * INPUT_C* sizeof(float), cudaMemcpyHostToDevice, stream));

std::cout << "start to infer ..." << std::endl;
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);

// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[0]));
CHECK(cudaFree(buffers[1]));
std::cout << "Inference Done." << std::endl;
}

bool read_TRT_File(const std::string& engineFile, IHostMemory*& trtModelStream, ICudaEngine*& engine)
{
std::fstream file;
std::cout << "loading filename from:" << engineFile << std::endl;
nvinfer1::IRuntime* trtRuntime;
//nvonnxparser::IPluginFactory* onnxPlugin = createPluginFactory(gLogger.getTRTLogger());
file.open(engineFile, std::ios::binary | std::ios::in);
file.seekg(0, std::ios::end);
int length = file.tellg();
std::cout << "length:" << length << std::endl;
file.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[length]);
file.read(data.get(), length);
file.close();
std::cout << "load engine done" << std::endl;
std::cout << "deserializing" << std::endl;
trtRuntime = createInferRuntime(gLogger.getTRTLogger());
//ICudaEngine* engine = trtRuntime->deserializeCudaEngine(data.get(), length, onnxPlugin);
engine = trtRuntime->deserializeCudaEngine(data.get(), length, nullptr);
std::cout << "deserialize done" << std::endl;
assert(engine != nullptr);
std::cout << "The engine in TensorRT.cpp is not nullptr" << std::endl;
trtModelStream = engine->serialize();
return true;
}
int main(int argc, char** argv){
IHostMemory* modelStream{ nullptr };
ICudaEngine* engine{ nullptr };
if (read_TRT_File("D:/code/python/pycharmProject/PytorchProj/saved_model/dense121/onnx_model/dense121_output100.trt", modelStream, engine)) std::cout << "tensorRT engine created successfully." << std::endl;
//if (read_TRT_File("E:/SampleONNX-master/mobilenetv2.trt", modelStream, engine)) std::cout << "tensorRT engine created successfully." << std::endl;
else std::cout << "tensorRT engine created failed." << std::endl;
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);
//查看engine的输入输出维度
for (int i = 0; i < engine->getNbBindings(); i++){
nvinfer1::Dims dims = engine->getBindingDimensions(i);
printf("index %d, dims: (");
for (int d = 0; d < dims.nbDims; d++)
{
if (d < dims.nbDims - 1)
printf("%d,", dims.d[d]);
else
printf("%d", dims.d[d]);
}
printf(")\n");
}
//确定动态维度
nvinfer1::Dims dims4;
dims4.d[0] = 1; // replace dynamic batch size with 1
dims4.d[1] = 224;
dims4.d[2] = 224;
dims4.d[3] = 3;
dims4.nbDims = 4;
context->setBindingDimensions(0, dims4);

//get data
float data[512] = { 0 };
float prob[100] = { 0 };
cv::Mat image = cv::imread("E:/plane.jpg");
cv::Mat img2;
image.convertTo(img2, CV_32F);
img2 = img2 / 255;
std::vector<float> vecHeight;
//这里多维的mat文件转一维的float是在图像数据连续的情况下,等价于三层逐层压进去,具体可以看上方参考博客
vecHeight.assign((float*)img2.data, (float*)img2.data + img2.total() * img2.channels());
float* input = new float[vecHeight.size()];
if (!vecHeight.empty())
{
memcpy(input, &vecHeight[0], vecHeight.size() * sizeof(float));
}
for (int i = 0; i < 500; i++) {
std::cout << input[i] << " ";
}

// Run inference

doInference(*context, input, prob, 1);

// Print histogram of the output distribution
std::cout << "Output:\n";
for (unsigned int i = 0; i < 100; i++){
std::cout << prob[i] << ", ";
}
std::cout << std::endl;

// Destroy the engine
context->destroy();
engine->destroy();
return 0;
}

加载cifar10数据集的dataloader,查看它的数据维度:

1
2
3
4
5
6
7
8
9
10
11
for (const auto& batch : test_data_loader){
torch::Tensor inputs_tensor = batch.data;
torch::Tensor labels_tensor = batch.target;
torch::Tensor outputs_tensor;
float outputs[10];
auto a_size = inputs_tensor.sizes();
int num_ = inputs_tensor.numel();
std::cout << a_size << std::endl << num_ << std::endl << std::endl;
auto a_size2 = labels_tensor.sizes();
int num_2 = labels_tensor.numel();
std::cout << a_size2 << std::endl << num_2 << std::endl << std::endl;}

模型训练

参考这篇博客,训练一个基于keras.application中的DenseNet网络的、处理Cifar10的模型,保存为了.h5格式。代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import keras as K
from keras import datasets, layers, models

def preprocess_data(X, Y):
"""pre-processes the data"""
X_p = X_p = K.applications.densenet.preprocess_input(X)
"""one hot encode target values"""
Y_p = K.utils.to_categorical(Y, 10)
return X_p, Y_p

"""load dataset"""
(trainX, trainy), (testX, testy) = K.datasets.cifar10.load_data()
x_train, y_train = preprocess_data(trainX, trainy)
x_test, y_test = preprocess_data(testX, testy)

""" USE DenseNet121"""
OldModel = K.applications.DenseNet121(include_top=False,input_tensor=None,weights='imagenet')
for layer in OldModel.layers[:149]:
layer.trainable = False
for layer in OldModel.layers[149:]:
layer.trainable = True

model = K.models.Sequential()

"""a lambda layer that scales up the data to the correct size"""
model.add(K.layers.Lambda(lambda x:K.backend.resize_images(x,height_factor=7,width_factor=7,data_format='channels_last')))

model.add(OldModel)
model.add(K.layers.Flatten())
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(256, activation='relu'))
model.add(K.layers.Dropout(0.7))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(128, activation='relu'))
model.add(K.layers.Dropout(0.5))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(64, activation='relu'))
model.add(K.layers.Dropout(0.3))
model.add(K.layers.Dense(10, activation='softmax'))
"""callbacks"""
# cbacks = K.callbacks.CallbackList()
# cbacks.append(K.callbacks.ModelCheckpoint(filepath='cifar10.h5',monitor='val_accuracy',save_best_only=True))
# cbacks.append(K.callbacks.EarlyStopping(monitor='val_accuracy',patience=2))

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
"""train"""
model.fit(x=x_train,y=y_train,batch_size=128,epochs=5,validation_data=(x_test, y_test))
model.summary()

model.save('cifar10.h5')

h5模型转pb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import tensorflow as tf
import keras as K
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

def convert_h5to_pb():
model = tf.keras.models.load_model("E:/cifar10.h5",compile=False)
model.summary()
full_model = tf.function(lambda Input: model(Input))
full_model = full_model.get_concrete_function(tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))

# Get frozen ConcreteFunction
frozen_func = convert_variables_to_constants_v2(full_model)
frozen_func.graph.as_graph_def()

layers = [op.name for op in frozen_func.graph.get_operations()]
print("-" * 50)
print("Frozen model layers: ")
for layer in layers:
print(layer)

print("-" * 50)
print("Frozen model inputs: ")
print(frozen_func.inputs)
print("Frozen model outputs: ")
print(frozen_func.outputs)

# Save frozen graph from frozen ConcreteFunction to hard drive
tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
logdir="E:/",
name="cifar10.pb",
as_text=False)
convert_h5to_pb()

#output
--------------------------------------------------
Frozen model inputs:
[<tf.Tensor 'Input:0' shape=(None, 32, 32, 3) dtype=float32>]
Frozen model outputs:
[<tf.Tensor 'Identity:0' shape=(None, 10) dtype=float32>]

pb转onnx

1
2
3
4
5
python -m tf2onnx.convert  --input E:/cifar10.pb --inputs Input:0 --outputs Identity:0 --output E:/cifar10.onnx --opset 11

python -m tf2onnx.convert --input E:/cifar102.pb --inputs Input:0 --outputs Identity:0 --output E:/cifar102.onnx --opset 11

python -m tf2onnx.convert --input cifar10fix.pb --inputs Input:0 --outputs Identity:0 --output cifar10fix.onnx --opset 11

onnx转trt

1
2
3
4
5
trtexec --onnx=cifar10.onnx --saveEngine=cifar10.trt --workspace=4096 --minShapes=Input:0:1x32x32x3 --optShapes=Input:0:1x32x32x3 --maxShapes=Input:0:50x32x32x3 --fp16

trtexec --onnx=afs.onnx --saveEngine=afs.trt --workspace=4096 --minShapes=Input:0:1x5 --optShapes=Input:0:1x5 --maxShapes=Input:0:50x5 --fp16

trtexec --onnx=dense121_6class.onnx --saveEngine=dense121_6class500.trt --workspace=3072 --minShapes=Input:0:1x128x64x1 --optShapes=Input:0:20x128x64x1 --maxShapes=Input:0:400x128x64x1 --fp16

失败,shell报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
----------------------------------------------------------------
Input filename: cifar10.onnx
ONNX IR version: 0.0.6
Opset version: 11
Producer name: tf2onnx
Producer version: 1.11.1 1915fb
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[07/28/2022-12:54:39] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
[07/28/2022-12:54:39] [E] Parsing model failed
[07/28/2022-12:54:39] [E] Engine creation failed
[07/28/2022-12:54:39] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec

这是因为目前TensorRt的BUG:#974 (comment),不支持resize_image。(不支持的还有NonZero op is not supported in TRT yet。)

image-20220728203714761

代码里使用的keras.backend.resize_images这个方法使用的是 the nearest model + half_pixel + round_prefer_ceil

一模一样的issue

解决:Lambda式子改成model.add(K.layers.Lambda(lambda x:tf.image.resize(x,[224,224])))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[07/29/2022-17:25:34] [I] Host Latency
[07/29/2022-17:25:34] [I] min: 1.82153 ms (end to end 2.79663 ms)
[07/29/2022-17:25:34] [I] max: 7.05655 ms (end to end 13.8956 ms)
[07/29/2022-17:25:34] [I] mean: 1.93649 ms (end to end 3.66704 ms)
[07/29/2022-17:25:34] [I] median: 1.90527 ms (end to end 3.60721 ms)
[07/29/2022-17:25:34] [I] percentile: 2.2793 ms at 99% (end to end 4.26883 ms at 99%)
[07/29/2022-17:25:34] [I] throughput: 0 qps
[07/29/2022-17:25:34] [I] walltime: 3.00986 s
[07/29/2022-17:25:34] [I] Enqueue Time
[07/29/2022-17:25:34] [I] min: 0.943115 ms
[07/29/2022-17:25:34] [I] max: 1.9104 ms
[07/29/2022-17:25:34] [I] median: 0.970215 ms
[07/29/2022-17:25:34] [I] GPU Compute
[07/29/2022-17:25:34] [I] min: 1.79199 ms
[07/29/2022-17:25:34] [I] max: 7.01645 ms
[07/29/2022-17:25:34] [I] mean: 1.89984 ms
[07/29/2022-17:25:34] [I] median: 1.86963 ms
[07/29/2022-17:25:34] [I] percentile: 2.24359 ms at 99%
[07/29/2022-17:25:34] [I] total compute time: 2.96756 s
&&&& PASSED TensorRT.trtexec # C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin\trtexec.exe --onnx=cifar10.onnx --saveEngine=cifar10.trt --workspace=4096 --minShapes=Input:0:1x32x32x3 --optShapes=Input:0:1x32x32x3 --maxShapes=Input:0:50x32x32x3 --fp16

现在就得到了trt,可以开始跑测试集了~

可能遇到的问题

Please make sure cudnn_ops_infer64_8.dll is in your library path

Could not load library cudnn_cnn_infer64_8.dll. Error code 1455
Please make sure cudnn_cnn_infer64_8.dll is in your library path!
or context null
原因:内存不足,重启VS或者电脑就OK。

得到的准确率很低,然后跑出来的准确率还不是固定的。
看代码发现是float[]、vector、tensor相互转换的时候出了问题,还包括GPU内存拷贝上。

目前,output[]转output_tensor是一定有问题的,输出的值不一样。
其次是,每次得到output[]都不一样,有时会有nan的结果。比如像

1
std::vector<float> outputs_vector(outputs, outputs + sizeof(outputs) / sizeof(float));

想把outputs[]转成vector,但转后只有前几个数一致。于是我通过torch::form_blob将outputs[]直接转成了tensor,而不是以vector为中介。

nan的原因是我初始化input[]之后没有给它赋值就传给context了。

image-20220803111138976

happy个锤锤

目前多次测试得到的loss和acc不变了,这么低我想原因要么是精度问题要么是数组载入有问题,下一步打算c++和python测同样的几个样本看得到的输出向量情况,如果有出入大概率是精度问题。

AFS模型保存转换记录

1
python -m tf2onnx.convert  --input lxbtest.pb --inputs Placeholder:0 --outputs save/restore_all:0 --output lxbtest.onnx --opset 11

把学波保存的pb模型转为onnx时报错:ValueError: Input 0 of node save/AssignVariableOp was passed int32 from Variable:0 incompatible with expected resource.。把学波保存的ckpt读完转成pb,然后再转onnx一样报错。然后发现,读取pb中计算图结点名称这样的一个操作都不行,也报上面的错。而其他正常的.pb文件则不会。

1
2
3
4
5
6
7
8
9
10
11
12
13
#打印出pb文件中所有node的name  一般来说,opname[0]就是输出结点name,opname[-1]则是输出节点name。
import tensorflow.compat.v1 as tf
from tensorflow.python.platform import gfile

config = tf.ConfigProto()
sess = tf.Session(config=config)
with gfile.FastGFile(r'saved_model/dense121/pb_model/dense121_output100.pb', 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
sess.graph.as_default()
tf.import_graph_def(graph_def, name='')
opname = [tensor.name for tensor in tf.get_default_graph().as_graph_def().node]
print(opname)

原因分析:不是所有的graph都能冻结,可以冻结推理图但训练图不行,因为训练图除了执行变量读取外还有变量赋值。The error is saying that a node (likely a variable assignment node) was given a float (the frozen value of the variable) but was expecting a resource (the mutable variable).

HRRP模型转换记录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
----------------------------------------------------------------
Input filename: dense121_6class.onnx
ONNX IR version: 0.0.6
Opset version: 11
Producer name: tf2onnx
Producer version: 1.11.1 1915fb
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[08/03/2022-17:53:29] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/03/2022-17:53:30] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[08/03/2022-18:05:17] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[08/03/2022-18:05:17] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[08/03/2022-18:05:17] [I] Engine built in 715.435 sec.

[08/03/2022-18:05:21] [I] Average on 10 runs - GPU latency: 1.61235 ms - Host latency: 1.65127 ms (end to end 3.09309 ms, enqueue 0.889355 ms)
[08/03/2022-18:05:21] [I] Average on 10 runs - GPU latency: 1.72634 ms - Host latency: 1.76223 ms (end to end 3.32732 ms, enqueue 0.8948 ms)
[08/03/2022-18:05:21] [I] Average on 10 runs - GPU latency: 1.61448 ms - Host latency: 1.65164 ms (end to end 3.10913 ms, enqueue 0.896533 ms)
[08/03/2022-18:05:21] [I] Average on 10 runs - GPU latency: 1.71921 ms - Host latency: 1.75725 ms (end to end 3.31399 ms, enqueue 0.932813 ms)
[08/03/2022-18:05:21] [I] Average on 10 runs - GPU latency: 1.61421 ms - Host latency: 1.64998 ms (end to end 3.10967 ms, enqueue 0.88999 ms)
[08/03/2022-18:05:21] [I] Host Latency
[08/03/2022-18:05:21] [I] min: 1.56018 ms (end to end 1.66077 ms)
[08/03/2022-18:05:21] [I] max: 2.76453 ms (end to end 4.55258 ms)
[08/03/2022-18:05:21] [I] mean: 1.72398 ms (end to end 3.229 ms)
[08/03/2022-18:05:21] [I] median: 1.65479 ms (end to end 3.11627 ms)
[08/03/2022-18:05:21] [I] percentile: 2.35864 ms at 99% (end to end 4.04886 ms at 99%)
[08/03/2022-18:05:21] [I] throughput: 0 qps
[08/03/2022-18:05:21] [I] walltime: 3.00615 s
[08/03/2022-18:05:21] [I] Enqueue Time
[08/03/2022-18:05:21] [I] min: 0.859131 ms
[08/03/2022-18:05:21] [I] max: 2.17993 ms
[08/03/2022-18:05:21] [I] median: 0.897461 ms
[08/03/2022-18:05:21] [I] GPU Compute
[08/03/2022-18:05:21] [I] min: 1.52576 ms
[08/03/2022-18:05:21] [I] max: 2.72894 ms
[08/03/2022-18:05:21] [I] mean: 1.68527 ms
[08/03/2022-18:05:21] [I] median: 1.61768 ms
[08/03/2022-18:05:21] [I] percentile: 2.32031 ms at 99%
[08/03/2022-18:05:21] [I] total compute time: 2.96102 s
&&&& PASSED TensorRT.trtexec # C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin\trtexec.exe --onnx=dense121_6class.onnx --saveEngine=dense121_6class.trt --workspace=4096 --minShapes=Input:0:1x128x64x1 --optShapes=Input:0:1x128x64x1 --maxShapes=Input:0:100x128x64x1 --fp16

拿到的hdf5模型,按之前的步骤转到trt很顺利,就是onnx to trt时间比较长。

Pytorch模型的转换部署

128reduce模型是AlexNet_128,256reduce模型是IncrementalModel(256)
128模型是AlexNet_128,256模型是AlexNet_256

首先是对pt文件到onnx的转换。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import torch
torch_model = torch.load("save.pt") # pytorch模型加载
batch_size = 1 #批处理大小
input_shape = (3,244,244) #输入数据

# set the model to inference mode
torch_model.eval()

x = torch.randn(batch_size,*input_shape) # 生成张量
export_onnx_file = "test.onnx" # 目的ONNX文件名
torch.onnx.export(torch_model,
x,
export_onnx_file,
opset_version=10,
do_constant_folding=True, # 是否执行常量折叠优化
input_names=["input"], # 输入名
output_names=["output"], # 输出名
dynamic_axes={"input":{0:"batch_size"}, # 批处理变量
"output":{0:"batch_size"}})

onnx到trt:

要注意,上面导出onnx时指定的批处理变量名要和下面转trt命令中的保持一致。

1
trtexec.exe --explicitBatch --workspace=3072 --minShapes=input:1x1x128x1 --optShapes=input:20x1x128x1 --maxShapes=input:512x1x128x1 --onnx=increment_6_128_save_reduce.onnx --saveEngine=temp.trt --fp16

QT中配置并运行

动态链接库不仅要在LIBS += \后面添加CUDA和TensorRT的lib文件夹路径,还要手动添加其中的必要库,否则在使用TensorRt推理时会报各种LNK无法解析的外部符号错误:

image-20220712131604604

以上的错误手动添加-lcudart-lnvinfer两个库就解决了。

关于Python

Embedding Python

起步代码博客

环境上 只是在编译命令添加了链接库:g++ test2.cpp -o test2 -ID:/evn/Python39/include -LD:/evn/Python39/libs -lpython39

注意一些函数在不同的python版本中也不同,比如PyObject_CallObject的用法之类的。

有参数传递的调用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// test2.cpp
#include<Python.h>
#include <iostream>
using namespace std;

int main()
{
Py_Initialize(); //1、初始化python接口
//初始化使用的变量
PyObject* pModule = NULL;
PyObject* pFunc = NULL;
PyObject* pName = NULL;
//2、初始化python系统文件路径,保证可以访问到 .py文件
PyRun_SimpleString("import sys");
PyRun_SimpleString("sys.path.append('./')");
//3、调用python文件名。当前的测试python文件名是 myadd.py
// 在使用这个函数的时候,只需要写文件的名称就可以了。不用写后缀。
pModule = PyImport_ImportModule("myadd");
//4、调用函数
pFunc = PyObject_GetAttrString(pModule, "AdditionFc");
//5、给python传参数
// 函数调用的参数传递均是以元组的形式打包的,2表示参数个数
// 如果AdditionFc中只有一个参数时,写1就可以了
PyObject* pArgs = PyTuple_New(2);
// 0:第一个参数,传入 int 类型的值 2
PyTuple_SetItem(pArgs, 0, Py_BuildValue("i", 2));
// 1:第二个参数,传入 int 类型的值 4
PyTuple_SetItem(pArgs, 1, Py_BuildValue("i", 4));
// 6、使用C++的python接口调用该函数
PyObject* pReturn = PyEval_CallObject(pFunc, pArgs);
// 7、接收python计算好的返回值
int nResult;
// i表示转换成int型变量。
// 在这里,最需要注意的是:PyArg_Parse的最后一个参数,必须加上“&”符号
PyArg_Parse(pReturn, "i", &nResult);
cout << "return result is " << nResult << endl;
//8、结束python接口初始化
Py_Finalize();
}
1
2
3
4
5
# myadd.py
def AdditionFc(a, b):
print("Now is in python module")
print("{} + {} = {}".format(a, b, a+b))
return a + b

虽然使用了PyRun_SimpleString("import sys");PyRun_SimpleString("sys.path.append('./')");但是会优先搜索本目录下同名的py文件进行调用,所以注意不要重名了
现在发现一个导致不能调用的原因,py文件中import了环境中没有包,这个包你可能是在虚拟环境中安了,但global中没有。
这之后就搞定了,下面在cpp中调用 调用onnx执行推理的py文件。代码如下,正常执行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// test.cpp
#include<Python.h>
#include <iostream>
using namespace std;

int main(){
Py_Initialize();

PyObject* pModule = NULL;
PyObject* pFunc = NULL;
PyObject* pName = NULL;

PyRun_SimpleString("import sys");
PyRun_SimpleString("sys.path.append('D:/code/python/pycharmProject/PytorchProj/')");
//PyRun_SimpleString("sys.path.append('')");

pModule = PyImport_ImportModule("test3");
if( pModule == NULL ){
cout <<"pModule not found" << endl;
return 1;
}
pFunc = PyObject_GetAttrString(pModule, "doinfer");
PyObject* pArgs = PyTuple_New(2);
char datapath[]="E:/207Project/Data/HRRP/Ball_bottom_cone/00.txt";
char modelpath[]="E:/tfmodels/model.onnx";
PyTuple_SetItem(pArgs, 0, Py_BuildValue("s", datapath));
PyTuple_SetItem(pArgs, 1, Py_BuildValue("s", modelpath));
PyObject* pReturn = PyObject_CallObject(pFunc, pArgs);

int nResult;
// i表示转换成int型变量。
// PyArg_Parse的最后一个参数,必须加上“&”符号
PyArg_Parse(pReturn, "i", &nResult);
cout << "return result is " << nResult << endl;

Py_Finalize();
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#test3.py
import numpy as np
import onnxruntime as rt

def getTensorFromTXT(filePath):#仅限于旧数据 数字与数字之间是空格
file_data=[]
t=0
with open(filePath, 'r') as f:
for line in f.readlines():
if(t< 2):
t+=1
continue
a=line[line.rfind(" ")+1:-1]
if(a!=""):
file_data.append(float(a))
#outtensor=torch.tensor(file_data)
outtensor = np.array(file_data, np.float32)
outtensor=(outtensor-min(outtensor)) / (max(outtensor) - min(outtensor))
outtensor = outtensor.reshape([1, 512])
return outtensor

def doinfer(a,b):
inputfile_path=str(a)
modelfile_path=str(b)
input = getTensorFromTXT(inputfile_path)
input = input.reshape([1, 1, 512])
sess = rt.InferenceSession(modelfile_path)
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], {input_name: input.astype(np.float32)})[0]
print(pred_onx)
print(np.argmax(pred_onx))
return np.argmax(pred_onx)
doinfer("E:/207Project/Data/HRRP/Cone/00.txt","E:/tfmodels/model.onnx")

模型转换的脚本编写

参考:

python学习——python中执行shell命令
Python 命令行参数的3种传入方式

常用API

tensor转vector

1
std::vector<float> output(output_tensor.data_ptr<float>(),output_tensor.data_ptr<float>()+output_tensor.numel());

vector与数组相互转

https://blog.csdn.net/Sagittarius_Warrior/article/details/54089242

vector转数组

1
2
3
4
float *buffer = new float[vecHeight.size()];
if (!vecHeight.empty()){
memcpy(buffer, &vecHeight[0], vecHeight.size()*sizeof(float));
}

1,vector作为动态数组,它的实现方法是:预先分配一个内存块,当感觉不够用的时候,再分配一个更大的内存块,然后自动将之前的数据拷贝到新的内存块中。

所以,出于效率考虑,如果实现知道待存储的数据长度,可以使用resize函数开辟足够的内存,避免后续的内存拷贝。

2,如果数组的元素是字符,建议使用string,而不是vector

CV.Mat转Vector

https://stackoverflow.com/questions/26681713/convert-mat-to-array-vector-in-opencv

找到tensor转float数组的方法了!!!

LibTorch使用 accessor 快速访问 tensor

1
2
3
4
5
6
7
8
9
10
11
12
/**
* x 的类型为 CPUFloatType { 100, 100 }
* x_data.size(0) = 100
* x_data.size(1) = 100
**/
auto x_data = x.accessor<float, 2>();
/* 访问单个元素 */
float x = x_data[50][50];
/* x_data.data() 是数据首地址 */
float array[100][100];
memcpy(array, x_data.data(), 100*100*sizeof(float));

float数组转tensor

用torch::form_blob

一些Cmake文件

经典文件(3070)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
cmake_minimum_required(VERSION 3.1 FATAL_ERROR)
project(cmakeProj)
set(CMAKE_CXX_STANDARD 11)

#option(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
#set(CMAKE_CXX_FLAGS "-fsanitize=undefined -fsanitize=address")
set(CUDA_NVRTC_SHORTHASH "XXXXXXXX") #resolve Failed to compute shorthash for libnvrtc.so
#find_package(CUDA REQUIRED)
# cuda
include_directories("C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/include")
link_directories("C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/lib")

# TensorRT
include_directories("D:/evn/TensorRT-7.2.3.4.Windows10.x86_64.cuda-11.1.cudnn8.1/TensorRT-7.2.3.4/include")
link_directories("D:/evn/TensorRT-7.2.3.4.Windows10.x86_64.cuda-11.1.cudnn8.1/TensorRT-7.2.3.4/lib")

#OpenCV
set(CMAKE_PREFIX_PATH "D:/evn/OpenCV/opencv-3.4.13-exe/opencv/build")
#set(OpenCV_DIR /home/User/opencv/build/)
find_package(OpenCV REQUIRED)
include_directories(${OPENCV_INCLUDE_DIRS} )

#libtorch
set(Torch_DIR "D:/evn/libtorch-1.8.2+cu111/libtorch")
find_package(Torch REQUIRED)

#matlab
include_directories("D:/softs/MATLAB/R2022a/extern/include")
link_directories("D:/softs/MATLAB/R2022a/extern/lib/win64/microsoft")

link_directories("E:/207Project/GUI207_V2.0/lib/TRANSFER")

#Python
include_directories("D:/evn/Python39/include")
link_directories("D:/evn/Python39/libs")

add_executable(cmakeProj main.cpp ToHRRP.h)

target_compile_features(cmakeProj PUBLIC cxx_range_for)
target_link_libraries(cmakeProj ${OpenCV_LIBS} ${TORCH_LIBRARIES} nvinfer libmat libmx libmex libeng mclmcr mclmcrrt ToHRRP python39)

实时监测分类

关于Socket:

代码参考:socket编程TCP/IP通信(windows下,C++实现)

环境配置参考:vs C++实现Socket通信、添加ws2_32.lib 静态链接库用VScode 在Windows下写简单的socket通讯

理论参考:c++ 实时通信系统(基础知识TCP/IP篇)计算机网络——网络字节序(大端字节序(Big Endian)\小端字节序(Little Endian))

关于多线程:

C++Thread

QThread

1
2
3
//下面两种链接方式都可以捏
connect(inferThread, &InferThread::sigInferResult,this,&MonitorPage::showInferResult);
connect(inferThread, SIGNAL(sigInferResult(QString)),this,SLOT(showInferResult(QString)));

多线程之间传信号

【Qt】 Qt中实时更新UI程序示例

BUG:

在子线程中连续调用terminal print 会导致没报错的错误(线程堵住?

在线程中连续使用matOpen会在第508次时打开失败。

关于C++调用python

调用python函数,C传数据矩阵给python,Python绘制。参考博客:c++调用python脚本,指针快速传递

ixcuvgas

模型训练

不能调用GPU训练的问题历程:

一开始是在cuda11.1的环境,cudnn版本忘了。conda TensorFlow2.3.0的环境下import TensorFlow然后tf.config.list_physical_devices('GPU')返回了空的设备列表。创了2.2.0的环境,提示说少各种cuxx.dll,网上找来往release目录放,最终还是有个dll提示缺少(虽然已经在了)。

于是转向根本问题:cuda、cudnn和TensorFlow的版本关系。参考博客快速配置tensorflow gpu环境(使用conda安装CUDA)

竟然可以在conda下隔离cuda环境,使用python3.8+cuda11.0+tensorflow2.4.0遂成功(conda install cudnn是不行的,不用单独下它。没关系)