Optimizing the T5 Model for Fast Inference
Model deployment is the most important aspect of developing a machine learning product. The development of every product starts with an idea and then we check if it’s feasible or not. Once we are sure that it is feasible we gather the data and start developing a model for it. During the model development, we also keep in mind the accuracy of the model, as well as its size and inference time as all of these factors, are very important for deploying the model but despite doing all these things sometimes we are settled with a model that has good accuracy but its inference time is not so good which might affect both the cost and user experience of the product hence such models needs to be optimized for inference as much as possible so that we can minimize the inference time to as low as possible. In this blog, we will explore the way by which we can optimize a T5 model for fast inference on a GPU. Motivation Behind T5 Optimization We were using the T5 model in our product PrepAI which is a question generation platform where users can upload different kinds of documents, videos, or copy and paste texts and the platform automatically generates different kinds of questions with the provided content. Hence we decided to work on optimizing the T5 model. In this tutorial, we will consider the T5 model is trained for translation tasks and will do the comparison on translation tasks only. Optimization Using TensorRT As a developer whenever we think of optimizing a model for the Nvidia GPU, TensorRT is the obvious choice that comes to our mind because it transforms the model’s graph into a form that can benefit from the architectural structure of the GPU. It achieves this by replacing certain operations in a graph with other operations which are equivalent to this operation but have less computational overhead. It also fuses certain operations together into one and reduces the computational overhead this process is known as graph fusion. We thought of trying the TensorRT for optimizing our T5 model. For this, we used the same codes in the official Nvidia repo to implement the optimization as we just wanted to see how much performance can be achieved by using the implementation. We found out that the model was around 3-4x faster for smaller sequences like <100 tokens but after that, the model starts to slow down, and at 512 tokens long sequences it becomes slower than the original torch model. At that time according to certain developers, there was some bug with TensorRT which later got fixed later on when we switched to the next phase in optimization involving ONNX. Later on, we also found that the implementation also excludes the past key values caching which is essential for speedup on longer sequences. However, this can change in the future and TensorRT might beat other implementations but at the time of writing this blog, it hasn’t been implemented yet. Optimization using ONNX runtime After trying out the TensorRT we decided to optimize the model with ONNX runtime. Converting any model to ONNX and applying little optimization automatically speed up the model by a small bit. During optimizing the model ONNX does basic operations like removing unused nodes, conversion of variables to constant, etc. For models like BERT, BART, GPT-2, Roberta, etc the ONNX also implements graph fusion which fuses the graphs of these models similar to the way TensorRT does. But unfortunately, the T5 model is not available yet. Since graph optimization was out of the plate the obvious method seemed to us was the conversion of the model into float16. Converting Encoder Into float16 The T5 model is an encoder-decoder model hence we tried to optimize the encoder first and then the decoder next. For doing this we utilized the ONNX runtime transformer optimization package. We first all the nodes of the ONNX encoder graph to float 16 and tried to evaluate the speed and accuracy of the model. We observed that converting all the nodes in the encoder destabilizes the encoder and hence the encoder only produces NAN values. The reason for this is that the encoder has a lot of operations which doesn’t work well in float16. Also, the T5 was never designed to be fully compatible with float16 but with bfloat16 and float32 data types. Hence we decided to identify those unstable nodes and keep them in float32 only doing that kept our model stable as well as improved the speed of the model. Conversion of Decoder Into float16 The next part is to optimize the decoder part of the model. Optimizing the decoder is more important than optimizing the encoder as in each generation the number of times an encoder run is for 1 time while the decoder runs for n times where n is the length of the target sentence. Similar to the encoder we started by converting all the nodes in the decoder part of the model into float16 and then evaluating to see how this has affected the accuracy and speed of the model we noticed that the model was not so fast for smaller sequences and it became slower for longer sequences. Digging deep into the ONNX runtime and some open source libraries we came to know about the cause behind this slowing. It was related to the data movement between the GPU memory and the RAM. We know that the decoder accepts the decoder_input_ids and the encoder_ouputs for generating the next token. Each time a prediction is made the input is transferred from RAM to GPU memory and after the calculation when the output which is in ORT format needs to be converted into NumPy which can only be done after moving the data back to CPU. The sizes of these tensors would be large. Moving these tensors consumes a lot of time and hence our model becomes slow as the size of the target output increases as
Read More