Date: Tuesday, September 22, 2020
Start Time: 10:00 am
End Time: 10:30 am
The use of low-precision arithmetic (8-bit and smaller data types) is key for the deployment of deep neural network inference with high performance, low cost and low power consumption. Shifting to low-precision arithmetic requires a model quantization step that can be performed at model training time (quantization-aware training) or after training (post-training quantization). Post-training quantization is an easy way to quantize already trained models that provides good accuracy/performance trade-off. In this talk, we review recent advances in post-training quantization methods and algorithms that help to reduce quantization error. We also show the performance speed-up that can be achieved for various models when using 8-bit quantization.