Abstract
We introduce 4D Motion Scaffolds (MoSca), a neural information processing system designed to reconstruct and
synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address
such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision
models, lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly
encodes the underlying motions / deformations. The scene geometry and appearance are then disentangled from
the deformation field, and are encoded by globally fusing the Gaussians anchored onto the MoSca and optimized
via Gaussian Splatting. Additionally, camera poses can be seamlessly initialized and refined during the
dynamic rendering process, without the need for other pose estimation tools. Experiments demonstrate
state-of-the-art performance on dynamic rendering benchmarks. The code will be released no later than the
acceptance of this paper.