Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J. Mitra, Paul Guerrero
Abstract: Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion as the first diffusion model for 3D generation and inference that can be trained using only monocular 2D supervision. At the heart of our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure into the diffusion process that gives us a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any viewpoint. We evaluate RenderDiffusion on ShapeNet and Clevr datasets and show competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes. We believe that our work promises to enable full 3D generation at scale when trained on massive image collections, thus circumventing the need to have large-scale 3D model collections for supervision.
Our method builds on the successful training and generation setup of 2D image diffusion models, which are trained to denoise input images that have various amounts of added noise. At test time, novel images are generated by applying the model in multiple steps to progressively recover an image starting from pure noise samples. We keep this training and generation setup, but modify the architecture of the main denoiser to encode the noisy input image into a 3D representation of the scene that is volumetrically rendered to obtain the denoised output image. This introduces an inductive bias that favors 3D scene consistency, and allows us to render the 3D representation from novel viewpoints. Figure below shows an overview of our architecture.
We evaluate RenderDiffusion on three tasks: monocular 3D reconstruction, unconditional generation, and 3D-aware inpainting.
Unlike existing 2D diffusion models, we can use RenderDiffusion to reconstruct 3D scenes from 2D images. To reconstruct the scene shown in an input image
Using a 3D-aware denoiser allows us to reconstruct a 3D scene from noisy images, where information that is lost to the noise is filled in with generated content. By adding more noise, we can generalize to input images that are increasingly out-of-distribution, at the cost of reconstruction fidelity. In figure below, we show 3D reconstructions from photos that have significantly different backgrounds and materials than the images seen at training time. We see that results with added noise (
Below we show qualitative results for unconditional generation.
Lastly, we apply our trained model to the task of inpainting masked 2D regions of an image while simultaneously reconstructing the 3D shape it shows.
We follow an approach similar to RePaint, but using our 3D denoiser instead of their 2D architecture.
Specifically, we condition the denoising iterations on the known regions of the image,
by setting
To aid reproducibility, we will soon release our datasets, code, and checkpoints.
Check out related prior and concurrent work:
- PixelNeRF is a non-generative method for inference of implicit 3D representations.
- EG3D is a generative 3D model based on GANs with triplane representation.
- Concurrently, GAUDI presents a diffusion model for generation of 3D camera paths and up to 300 scenes. However, unlike ours, it requires 2 stages of training, with the diffusion model operating only on latent space. In contrast, our diffusion model is defined directly over pixels - this allows exciting applications, such as refinement of generated image and inpainting.
@article{anciukevicius2022renderdiffusion,
title = {{RenderDiffusion}: Image Diffusion for {3D} Reconstruction, Inpainting and Generation},
author = {Titas Anciukevicius and Zexiang Xu and Matthew Fisher and Paul Henderson and Hakan Bilen and Mitra, Niloy J. and Paul Guerrero},
year = 2022,
journal = {arXiv}
}