Date: Wednesday, May 21
Start Time: 2:40 pm
End Time: 3:10 pm
In this presentation, we provide an overview of vision-language models (VLMs) and their deployment on edge devices using Hugging Face’s recently released SmolVLM as an example. We will examine the training process of VLMs, including data preparation, alignment techniques and optimization methods necessary for embedding visual understanding capabilities within resource-constrained environments. We will explain practical evaluation approaches, emphasizing how to benchmark these models beyond accuracy metrics to ensure real-world viability. And to illustrate how these concepts play out in practice, we’ll share data from recent work implementing SmolVLM in an edge device.