Overcoming the Context Memory Wall for Edge VLM Implementations

Vision-focused edge AI, where CNNs ruled, was compute-bound. However, today’s vision-language models (VLMs) have flipped the script: they feature as many as 70B parameters and 200K+ token contexts and are highly memory-bound, shifting the bottleneck from compute to memory. KV caches overwhelm SRAM, data movement demands choke memory bandwidth and traditional NPUs fail. In this talk, we dissect the architectural differences between CNNs and VLMs and the memory bottlenecks that hinder edge deployment. Further, we address solutions such as advanced cache hierarchies (rolling state, tiered SRAM→DRAM) and disaggregated memory (NPU-to-system RAM offload), which enable production VLMs. Finally, we will show how Expedera’s scalable silicon IP neural engine utilizes these techniques to deliver efficient implementation of VLMs for the edge.

Track

Enabling Technologies

Session Speakers

Athish Rahul Rao
Staff Software Engineer, Expedera

Athish Rahul Rao is a Staff Software Engineer at Expedera, where he leads the development and evolution of the company’s software stack. Over the past five years, he has been a key contributor to core initiatives, including building custom LLM inference engines with proprietary scheduling algorithms, driving ISP-based computer vision pipelines for a major smartphone manufacturer, architecting distributed LLM inference systems across heterogeneous hardware and developing model-optimization techniques for neural network deployments at the edge. Prior to Expedera, Athish was a Software Engineer at Arista Networks. He holds a Master of Engineering from Cornell University and a Bachelor of Technology from Mahindra University.

Overcoming the Context Memory Wall for Edge VLM Implementations

Track

Session Speakers

Athish Rahul Rao

See you May 11-13, 2026 in Silicon Valley, California

Sponsors and Exhibitors

Get in Touch

Share