In this presentation, we will provide an overview of workflows for deploying compressed deep learning models, starting with PyTorch and creating native C++ application code running in real-time on embedded hardware platforms. We illustrate these workflows on smartphones with real-world examples targeting ARM-based CPU, GPUs, and NPUs as well as embedded chips and modules like the NXP i.MX8+ and NVIDIA Jetson Nano. We examine TorchScript, architecture-side optimizations, quantization and common pitfalls. Additionally, we show how the PyTorch deployment workflow can be extended to conversion to ONNX and quantization of ONNX models using an ONNX Runtime. On the application side, we demonstrate how deployed models can be integrated efficiently into a C++ library that runs natively on mobile and embedded devices and highlights known limitations.