Vision-focused edge AI, where CNNs ruled, was compute-bound. However, today’s vision-language models (VLMs) have flipped the script: they feature as many as 70B parameters and 200K+ token contexts and are highly memory-bound, shifting the bottleneck from compute to memory. KV caches overwhelm SRAM, data movement demands choke memory bandwidth and traditional NPUs fail. In this talk, we dissect the architectural differences between CNNs and VLMs and the memory bottlenecks that hinder edge deployment. Further, we address solutions such as advanced cache hierarchies (rolling state, tiered SRAM→DRAM) and disaggregated memory (NPU-to-system RAM offload), which enable production VLMs. Finally, we will show how Expedera’s scalable silicon IP neural engine utilizes these techniques to deliver efficient implementation of VLMs for the edge.

