Physical AI is caught between two computational titans: autoregressive (AR) transformers, which predict discrete action tokens, and diffusion transformers (DiTs), which refine continuous motion trajectories. Both architectures face a “memory wall,” but for different reasons: AR is bottlenecked by the sequential growth of the KV cache, while DiT is strained by repetitive denoising loops. This talk analyzes how next-generation neural processing units (NPUs)—including Google Coral and VeriSilicon Vivante IP—are being re-engineered to handle both. We explore specialized caching schemes like selective KV reuse and hardware-aware FlashAttention that minimize memory bandwidth. By optimizing the silicon for both discrete “thinking” and continuous “acting,” these NPU strategies enable battery-operated robots to maintain real-time response across the entire physical AI life cycle.

