Date: Thursday, September 17, 2020
Start Time: 1:00 pm
End Time: 1:30 pm
In this talk, we present techniques for obtaining the best inference performance when deploying machine learning applications in the cloud. With the increasing use of AI in applications ranging from image classification/object detection to natural language processing, it is vital to deploy AI applications in ways that are scalable and efficient. Much work has focused on how to distribute DNN training for parallel execution using machine learning frameworks (TensorFlow, MXNet, PyTorch and others). There has been less work on scaling and deploying trained models on multi-processor systems. We present a case study analysis of scaling an image classification application in the cloud using multiple Kubernetes pods. We explore the factors and bottlenecks affecting performance and examine techniques for building a scalable application pipeline.