Vision-Language Models in Practice: Architecture and Performance

Unimodal vision systems are powerful but limited when users need flexible queries, richer semantics or reasoning that combines images/video with language. Vision-language models (VLMs) address this gap by jointly training visual and text encoders to align their representations, enabling tasks such as retrieval, captioning, document understanding and robotics planning. In this talk, we will introduce the core building blocks of VLMs—vision encoders, language encoders and embedding alignment—using an end-to-end architecture view and practical examples. We explore VLM applications, then take a close look at the CLIP model as a case study, analyzing each stage of the model and showing how design choices affect overall performance. Attendees will leave with a clear mental model of how VLMs work, plus actionable guidance for building efficient VLM pipelines and evaluating trade-offs in compute cost and accuracy.

Track

Fundamentals

Session Speakers

Rajy Rawther
PMTS Software Architect, AMD

Rajy Rawther is a PMTS Software Architect and the lead developer of computer vision libraries at Advanced Micro Devices. She has been with AMD for more than 22 years and has had multiple engineering roles in areas including video codecs, computer vision and multimodal data processing for AI applications. She has extensive experience in heterogeneous programming languages and performance optimization of machine learning and computer vision algorithms. Rajy has a Master of Engineering degree in System Science and Automation from the Indian Institute of Science and an undergraduate degree in Electrical and Electronics Engineering.

Vision-Language Models in Practice: Architecture and Performance

Track

Session Speakers

Rajy Rawther

See you May 11-13, 2026 in Silicon Valley, California

Sponsors and Exhibitors

Get in Touch

Share