Space-time Neural Irradiance Fields for Free-Viewpoint Video

Wenqi Xian

Cornell Tech

Jia-Bin Huang

Virginia Tech

Johannes Kopf


Changil Kim



We present a method that learns a spatiotemporal neural irradiance field for dynamic scenes from a single video. Our learned representation enables free-viewpoint rendering of the input video. Our method builds upon recent advances in implicit representations. Learning a spatiotemporal irradiance field from a single video poses significant challenges because the video contains only one observation of the scene at any point in time. The 3D geometry of a scene can be legitimately represented in numerous ways since varying geometry (motion) can be explained with varying appearance and vice versa. We address this ambiguity by constraining the time-varying geometry of our dynamic scene representation using the scene depth estimated from video depth estimation methods, aggregating contents from individual frames into a single global representation. We provide an extensive quantitative evaluation and demonstrate compelling free-viewpoint rendering results.

Paper (arXiv)

Code (Coming Soon)


Input Videos are retimed to match the speed of the result videos. Some are slowed down to better reveal details.

Meshes results are rendered using textured meshes that are each constructed from a single input depth map, respectively.

For Inpainted Meshes, the holes revealed by disocclusions in the mesh renderings are inpainted using a state-of-the-art video inpainting method. Every 4th frame is used for video inpainting due to memory issues and the resulting videos are played back 4 times slower to compensate for the missing frames.

NeRF + T refers to the original NeRF with an additional time dimension added, but trained without our losses. Viewing directions are not used to train these models.

Ablation Studies

NeRF + T: The baseline NeRF model with an additional time dimension added, but trained with only the color reconstruction loss. Viewing directions are not used.

Ld: Trained with our depth reconstruction loss (Equation 5).

Le: Trained with our empty-space loss (Equation 7).

Ls: Trained with our static scene loss (Equation 8).


