World-scale vision-language-action (VLA) models are the new frontier in AI for autonomous driving and robotics, enabling systems to perceive, reason and act in complex real-world environments. Using the open-source Pi-0.5 model as an exemplar, we will review the structure of VLA models and examine how they combine advanced techniques from both language and vision transformers into a unified architecture. We’ll then examine the unique challenges VLAs introduce for embedded deployment, including quantization complexity and a mix of compute-dominated and bandwidth-bottlenecked workloads that must be handled efficiently. We will explore practical strategies for mapping VLA workloads to embedded compute engines and present results from porting and optimizing Pi-0.5 on Quadric’s Chimera GPNPU, showing how Chimera’s combination of software control and determinism makes it uniquely well-suited to optimized VLA inference at the edge.

