In this presentation we’ll look at recent advances in depth estimation from images. We’ll first focus on the ability to estimate metric depth from monocular camera images from different domains and camera parameters. Next we’ll look at extensions to the multi-view setting and cover an efficient diffusion-based architecture capable of encoding hundreds of images and rendering depth and RGB images from novel viewpoints. Throughout the presentation we’ll focus on the interplay between architectural inductive bias, training data and optimization objective and their combined effect on building geometric foundation models that estimate 3D structure from images.