Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models

Date: Thursday, May 23

Start Time: 2:05 pm

End Time: 2:35 pm

In this talk, we will explore the use of multimodal large language models in real-world edge applications. We will begin by explaining how these large multimodal models (LMMs) work and highlighting their key components, giving special attention to how LMMs merge understanding in the vision and language domains. Next we’ll discuss the process of training LMMs and the types of data needed to tune them for specific tasks. Finally, we’ll highlight some of the key challenges in deploying LMMs in resource-constrained edge devices and share techniques for overcoming these challenges.

Track

Session Speakers

Adel Ahmadyan
Staff Engineer, Meta Reality Labs

Adel Ahmadyan is a Staff Engineer at Meta Reality Labs, where he is the Technical Lead for the development of multimodal systems and large vision-language models. Prior to joining Meta, Dr. Ahmadyan was a key contributor at Google, where he focused on on-device machine learning. His work at Google was instrumental in advancing the company’s on-device ML capabilities and enabled features across many products, including Google Pixel, Meet, Photos and YouTube. With over a decade of industry experience in computer vision, he has been constantly pushing the boundaries of real-time computer vision and live perception on-device and on-edge. Adel holds a PhD from the University of Illinois Urbana-Champaign. He also holds both a master’s and bachelor’s degree from Sharif University of Technology. Adel lives in San Francisco and spends most of his spare time in the Sierras.

Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models

Track

Session Speakers

Adel Ahmadyan

Sponsors & Exhibitors

Get in Touch

Share