When one thinks of neural networks, probably the first thing they think of is a deep learning framework like Tensorflow or PyTorch. The creation of deep learning frameworks were crutial to the adoption of deep learning in the products we use every day. Instead of having to write your own CUDA kernels to leverage the parallelization power of a GPU, you can easily structure compenents together into a graph and managed the training of that graph all in python, letting the frameworks handle all the hard parts. But this is only half of the solution to making deep learning a viable tool to do useful work.
Once the weights have been finalized and the training completed you are left with something that while may be good at its particular task, is far less efficent than it could be. Many people when deploying neural network models, take the extra step to optimize their model before deploying it so as to get the maximum throughput through the network. There are a couple projects that look to be the Tensorflow or PyTorch of this deployment phase (also known as the inference phase). Those include Facebooks GLOW compiler, DLVM, ONNC, nGraph, TVM and XLA. NVIDIA creates one specifically for optimizing networks on their GPUs for inference called TensorRT. If you want to know more about the optimizations that TensorRT does, take a look at these blog posts: https://devblogs.nvidia.com/tag/tensorrt/.
In January of 2017 when I joined NVIDIA for a 8 month co-op, TensorRT 2.1 just came out. At the time for a user to optimize a network they had a couple of options, all of which required setting up a large C++ infrastructure around the model injest. At the time the only supported “parser” was for Caffe Models, and everyone else would have to manually extract weights and read them into a network definition API. Here was a minimal example from around then:
The user would start by creating an injest system that would take a Caffe model, parse it then create an engine.
TensorRT Engine Builder
Then using the engine they would setup an inference pipeline, that would manage transfering data to the GPU and results back.
TensorRT Engine Executor
For people who had the resources to develop this sort of infrastrucutre it was entirely worth it to get the performace benifits, but to say the least it was in accessible for prototyping and light applications.
Enter the TensorRT Python API
For actual deployments C++ is fine, if not preferable to Python, especially in the embedded settings I was working in. However, there is still quite a bit of development work to be done between having a trained model and putting it out in the world. One example is quantization.
A Quick Primer on Quantization
Typically (at least in 2017), neural networks are trained at FP32 precision. This is mostly due to the hardware available at that time specializing in FP32 math. But this precision has a lot more granularity than is necessary and ultimately you can get significant performance improvements by lowering the precision (to FP16 or INT8 or INT9 for example). For INT8 in particular, you need to go through a process called quantization that maps the range of weights onto an 8bit space.
In TensorRT there are APIs that help do this quantization for you in a way that hopefully minimizes the precision lost by using this less granular representation. This and other advanced features are usually what need to be experimented with before deploying a model but having to set up quantization infrastructure in C++ just for experiments is a lot of work.
Wrapping a C++ Library
So instead of having to rewrite a library in Python there are APIs and tools you can use to wrap an existing library and expose a python interface. This is in fact the approach that libraries like PyTorch and Tensorflow use, a C++ core with a Python Frontend. There are couple tools that people use that automate the process of wrapping a library, one is SWIG which is able to auto generate an interface based on a header file and an interface file and PyBind11 a newer library that takes more work to define an interface but is lighter weight. For easy of prototyping and the shear amount of code to wrap I chose to use SWIG in my initial versions of the Python API for TensorRT but in later versions this was ported to PyBind11.
TensorRT also requires directly interfacing with the CUDA Device API to transfer over data to a GPU and manage that memory through inference. There are a few python libraries that provide this capability. The one used officially with the TensorRT API is PyCUDA, but effort was put in to make sure other libraries such as CuPy (or even PyTorch) also work.
If you are going through the trouble to make a nice python API you should also try to abstract out a lot of the boilerplate that comes with a library targeted at C++ which I did in the
utils sub package so as to maintain as much similarity between the C++ and Python APIs in the main package but allow for higher level features else where.
At this point I was able to do a lot of the basic work you’d want to do with TensorRT in Python:
TensorRT Engine Builder in Python
TensorRT Engine Executor in Python
Leveraging the Python Ecosystem
There are some great things about python that make life a lot easier. Data manipulation with NumPy for one is a massive benefit, so is being able to directly interface with other deep learning frameworks and tools. So one of the first things I added as soon as the library was wrapped was NumPy compatibility. This means that you can use NumPy arrays not only for your data, but also to transfer your weights around. This allows people using libraries like PyTorch (note: this was before ONNX came out) to extract their weights into NumPy arrays and then load them into TensorRT all in Python.
Importing a PyTorch Model Manually
This was a new capability introduced by the Python API because of Python and NumPy.
We can also use NumPy and other tools like SciPy to do some of the data preprocessing required for inference and the quantization pipeline.
Quantization with TensorRT Python
This blog post describes using the Python API to do the majority of the work for INT8 Quantization and deploying on a embedded platform:
Higher level APIs
Using Python allows for a lot of abstracted work. One of the things introduced in the Python API is the Lite API a essentially one liner to create an engine and run it.
And that defines a full pipeline with pre/post processing allowing for integration with other apps easily. The lite api also supports a variety of batching formats automatically so its pretty easy to just throw data at it and get results out.
Take a look at the examples in the TensorRT distribution for demonstrations of this.
Enabling other applications
Since it’s now easy to integrate TensorRT its pretty straightforward to include optimized deep learning models in your projects. This enables a lot of cool new applications in spaces such as smart cities, robotics and web applications.