Convert Onnx BERT model to TensorRT

Hemanth Sharma
4 min readJul 20, 2020

Prerequisites:

This tutorial assumes the following is done:
1. Installation of specific version of CUDA which are supported by tensorrt (cuda 10.2, cuda 11),
2. Install supported CuDNN
3. NVIDIA drivers

1. Installation:

On successful CUDA installation, install TensorRT using the following link
https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html

To give a gist of the installation, TensorRT can be installed in few ways, out of which I found installing using tar file was the easiest. Installation through debian works but its a bit of a hassle.

Its not necessary to install TensorRT OSS components (https://github.com/NVIDIA/TensorRT)

There are few libraries which turn the model directly to tensorrt:
https://github.com/NVIDIA-AI-IOT/torch2trt
https://github.com/onnx/onnx-tensorrt

But none of it has worked for me. Which is why this tutorial.

This tutorial is to convert onnx model to tensorrt. If you want to know how to convert pytorch model to onnx, you can follow my short tutorial on that

2. Convert onnx model to a simplified onnx model

checkout the steps to follow to simplify the onnx model: https://github.com/daquexian/onnx-simplifier

3. Convert onnx model to TensorRT engine

import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda
def build_engine(model_file, max_ws=512*1024*1024, fp16=False):
print("building engine")
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
builder.fp16_mode = fp16
config = builder.create_builder_config()
config.max_workspace_size = max_ws
if fp16:
config.flags |= 1 << int(trt.BuilderFlag.FP16)

explicit_batch = 1 << (int)trt.NetworkDefinitionCreationFlag.\
EXPLICIT_BATCH)
network = builder.create_network(explicit_batch)
with trt.OnnxParser(network, TRT_LOGGER) as parser:
with open(model_file, 'rb') as model:
parsed = parser.parse(model.read())
print("network.num_layers", network.num_layers)
#last_layer = network.get_layer(network.num_layers - 1)
#network.mark_output(last_layer.get_output(0))
engine = builder.build_engine(network, config=config)
return engine
engine = build_engine("checkpoints/simplified_model.onnx")

Save the engine after building it. Because building the engine takes time.

with open('engine.trt', 'wb') as f:
f.write(bytearray(engine.serialize()))

4. Run Inference:

Load the engine

runtime = trt.Runtime(TRT_LOGGER)
with open('./engine.trt', 'rb') as f:
engine_bytes = f.read()
engine = runtime.deserialize_cuda_engine(engine_bytes)

Create execution context as shown below

bert_context = engine.create_execution_context()

Define inputs as numpy arrays ( input-ids, token-ids and attention-mask ) for BERT.

Define an output variable which also is a numpy array which has shape of batch X num_of_classes. If your model is not BERT, then define a zeros array of shape same as your model output.

'''
inputs
'''
input_ids = numpy array ( size: batch X seq_len) ex: (1 X 30 )
token_type_ids = numpy array ( size: batch X seq_len) ex: (1 X 30 )
attention_mask = numpy array ( size: batch X seq_len) ex: (1 X 30 )
'''
outputs
'''
bert_output = torch.zeros((1, num_of_classes),device=device).cpu().\
detach().numpy()

Allocate memory for the inputs and outputs in GPU:

'''
memory allocation for inputs
'''
d_input_ids = cuda.mem_alloc(batch_size * input_ids.nbytes)
d_token_type_ids = cuda.mem_alloc(batch_size * token_type_ids.\
nbytes)
d_attention_mask = cuda.mem_alloc(batch_size * attention_mask.\
nbytes)
'''
memory allocation for outputs
'''
d_output = cuda.mem_alloc(batch_size * bert_output.nbytes)

Create bindings array

bindings = [int(d_input_ids), int(d_token_type_ids), int(d_attention_mask), int(d_output)]

Create stream and transfer inputs to GPU (can be sync or async ). ‘async ’ shown here.

stream = cuda.Stream()# Transfer input data from python buffers to device(GPU)
cuda.memcpy_htod_async(d_input_ids, input_ids, stream)
cuda.memcpy_htod_async(d_token_type_ids, token_type_ids, stream)
cuda.memcpy_htod_async(d_attention_mask, attention_mask, stream)

Execute using the engine

bert_context.execute_async(batch_size, bindings, stream.handle, None)

Transfer output back from GPU to python buffer variable

cuda.memcpy_dtoh_async(bert_output, d_output, stream)
stream.synchronize()

Now the bert_output variable in which we stored zeros will have the prediction.

Run softmax and get the most probable class

pred = torch.tensor(bert_output)
pred_output_softmax = nn.Softmax()(pred)
_, predicted = torch.max(pred_output_softmax, 1)

--

--