Date: Tuesday, May 23
Start Time: 9:30 am
End Time: 10:40 am
First-person or “egocentric” perception requires understanding the video and multimodal data that streams from wearable cameras and other sensors. The egocentric view offers a special window into the camera wearer’s attention, goals, and interactions with people and objects in the environment, making it an exciting avenue for both augmented reality and robot learning. The multimodal nature is particularly compelling, with opportunities to bring together audio, language, and vision. To begin, I’ll introduce Ego4D, a massive new open-sourced multimodal egocentric dataset that captures the daily-life activity of people around the world. The result of a multi-year, multi-institution effort, Ego4D pushes the frontiers of first-person multimodal perception with a suite of research challenges ranging from activity anticipation to audio-visual conversation. Building on this resource, I’ll present our ideas for searching egocentric videos with natural language queries (“Where did I last see X? Did I leave the garage door open?”), injecting semantics from text and speech into powerful video representations, and learning audio-visual models to understand a camera wearer’s physical environment or augment their hearing in busy places. I’ll also touch on interesting performance-oriented challenges raised by having very long video sequences (hours!) and ideas for learning to scale retrieval and encoders.