The next generation of AI agents is moving beyond cloud-based text-only models and will interact with the physical multimodal world in real time. In the vision domain, AI agents rely on vision-language models (VLMs) as their backbones. However, deploying massive VLMs with billions of parameters on embedded devices remains a significant engineering hurdle. Drawing on our recent ICML and CVPR research papers, we will explore advancements in VLM optimization, specifically how distillation and pruning transform “heavyweight” models into lean, edge-ready engines. In particular, we will examine recent feature alignment and mixture-of-experts methods for distillation and training-free token pruning approaches, providing practical insights for building computationally efficient VLMs for embedded systems.
