Unimodal vision systems are powerful but limited when users need flexible queries, richer semantics or reasoning that combines images/video with language. Vision-language models (VLMs) address this gap by jointly training visual and text encoders to align their representations, enabling tasks such as retrieval, captioning, document understanding and robotics planning. In this talk, we will introduce the core building blocks of VLMs—vision encoders, language encoders and embedding alignment—using an end-to-end architecture view and practical examples. We explore VLM applications, then take a close look at the CLIP model as a case study, analyzing each stage of the model and showing how design choices affect overall performance. Attendees will leave with a clear mental model of how VLMs work, plus actionable guidance for building efficient VLM pipelines and evaluating trade-offs in compute cost and accuracy.

