
I attended the International Conference on 3D Vision (3DV) in March. 3DV is a conference that brings together researchers in 3D computer vision and graphics. At the conference, I presented our paper, Spann3R: 3D Reconstruction with Spatial Memory, which was accepted by 3DV as an award candidate.
Our work addresses the problem of dense 3D reconstruction from images only, without any prior knowledge of camera parameters or depth information. Due to the inherent ambiguity of interpreting 3D structures from 2D images, traditional dense reconstruction pipelines typically decompose the task into a series of minimal problems, which are then chained together. This approach usually requires substantial manual tweaks as well as engineering efforts. In contrast, our approach directly maps a set of 2D images to 3D using a neural network, enabling a fully feed-forward dense reconstruction.
The key idea of our method is to manage a spatial memory that stores previous states and learns to query this spatial memory to reconstruct the geometry of the subsequent frames. Our results demonstrate a proof-of-concept for feed-forward 3D reconstruction, and open up exciting directions for fully learning-based approaches to Structure-from-Motion (SfM) and Simultaneous Localization and Mapping (SLAM). Interestingly, as a by-product, our method can perform dense reconstruction without explicitly estimating camera poses—a concept that some later works have referred to as pose-free reconstruction.
Beyond my own presentation, I attended several insightful talks, including those by Jon Barron, Noah Snavely, and Fei-Fei Li. In Jon’s talk, he mentioned one particular interesting topic on “Why care about 3D?”. He discussed various perspectives on the role of 3D in computer vision and robotics.
Jon challenged the traditional assumption that a robot must reconstruct its environment in 3D to navigate it effectively. With the progress of deep learning, robots can map observations directly to control signals without explicit 3D modeling—similar to trends in autonomous driving. Likewise, the idea that realistic image or video generation requires full 3D reconstruction is challenged with the rapid progress of video diffusion models.
Jon offered several interesting arguments for why 3D still matters. One reason is efficiency—once a 3D scene is reconstructed, it can be rendered from many viewpoints at low cost. More fundamentally, humans live in a 3D world—so if we expect AI systems to perceive and interact like us, they must reason in 3D too.
Yet, the community has not reached a consensus on how to let AI models learn 3D efficiently. While vast amounts of 2D image data are readily available thanks to smartphones and the Internet, acquiring large-scale, high-quality 3D data remains costly. Hopefully, the progress in 3D foundation models can unlock new ways to scale up the learning of 3D representations from web-scale casual videos without ground-truth 3D, and bridge the gap between 2D and 3D through unified multi-modal representations.
Further information on Hengyi’s project can be found here https://hengyiwang.github.io/projects/spanner