Successful deployment of AI on edge platforms requires system-level strategies that account for critical considerations such as latency bounds, memory behavior and orchestration of heterogeneous hardware cores. In this talk, we present techniques for running deep learning pipelines on edge computing platforms, emphasizing architectural decisions that determine real-time behavior and scalability in production systems. We compare monolithic models with decomposed, multistage pipelines, analyzing how partitioning computation across CPUs, GPUs, NPUs and DSPs affects synchronization, memory traffic and worst-case latency. We also explore how SoC topology and memory hierarchy constrain deployment options and how these constraints influence model structure, compiler behavior and runtime scheduling. We share concrete deployment patterns for hardware-aware optimization, multicore scheduling and cross-platform execution, prioritizing deterministic behavior over peak throughput.

