Date: Wednesday, May 18 (Main Conference Day 2)
Start Time: 10:50 am
End Time: 11:20 am
Determining bird’s eye view (BEV) object positions and tracks, and modeling the interactions among objects, is vital for many applications, including understanding human interactions for security and road object interactions for automotive applications. With traditional methods, this is extremely challenging and expensive due to the supervision required in the training process. We introduce a weakly supervised end-to-end computer vision pipeline for modeling object interactions in 3D. Our architecture trains a unified network in a weakly supervised manner to estimate 3D object positions by jointly learning to regress the 2D object detection and the scene’s depth in a single feed-forward CNN pass, and to subsequently model object tracks. The method learns to model each object as a BEV point, without the need for 3D or BEV annotations for training, and without supplemental (e.g., LiDAR) data. We achieve results comparable to the state-of-the-art while significantly reducing development costs and computation requirements.