Date: Wednesday, May 22
Start Time: 9:40 am
End Time: 10:40 am
The field of computer vision is undergoing another profound change. Recently, “generalist” models have emerged that can solve a variety of visual perception tasks. Also known as foundation models, they are trained on huge internet-scale unlabeled or weakly labeled data and can adapt to new tasks without any additional supervision or with just a small number of manually labeled samples. Moreover, some are multimodal: they understand both language and images and can support other perceptual modes as well. In our 2024 Keynote, Professor Yong Jae Lee from the University of Wisconsin-Madison will present recent groundbreaking research on creating intelligent systems that can learn to understand our multimodal world with minimal human supervision. He will focus on systems that can understand images and text, and also touch upon those that utilize video, audio and LiDAR. Since training foundation models from scratch can be prohibitively expensive, Yong Jae will discuss how to efficiently repurpose existing foundation models for use in application-specific tasks. He will also discuss how these models can be used for image generation and, in turn, for detecting AI-generated images. He’ll conclude by highlighting key remaining challenges and promising research directions. Join us to learn how emerging techniques will address today’s neural network training bottlenecks, facilitate new types of multimodal machine perception and enable countless new applications.