Track: Oral 4A: Neural 3D Rendering

Wed 2 Oct. 4:30 - 4:40 PDT

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Basile Van Hoorick · Rundi Wu · Ege Ozguroglu · Kyle Sargent · Ruoshi Liu · Pavel Tokmakov · Achal Dave · Changxi Zheng · Carl Vondrick

Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose GCD, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality. Project webpage: https://gcd.cs.columbia.edu/

Wed 2 Oct. 4:40 - 4:50 PDT

Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering

Antoine Guedon · Vincent Lepetit

We propose Gaussian Frosting, a novel mesh-based representation for high-quality rendering and editing of complex 3D effects in real-time. Our approach builds on the recent 3D Gaussian Splatting framework, which optimizes a set of 3D Gaussians to approximate a radiance field from images. We propose first extracting a base mesh from Gaussians during optimization, then building and refining an adaptive layer of Gaussians with a variable thickness around the mesh to better capture the fine details and volumetric effects near the surface, such as hair or grass. We call this layer Gaussian Frosting, as it resembles a coating of frosting on a cake. The fuzzier the material, the thicker the frosting. We also introduce a parameterization of the Gaussians to enforce them to stay inside the frosting layer and automatically adjust their parameters when deforming, rescaling, editing or animating the mesh. Our representation allows for efficient rendering using Gaussian splatting, as well as editing and animation by modifying the base mesh. We demonstrate the effectiveness of our method on various synthetic and real scenes, and show that it outperforms existing surface-based approaches. We will release our code and a web-based viewer as additional contributions.

Wed 2 Oct. 4:50 - 5:00 PDT

Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration

Zhihao Liang · Qi Zhang · WENBO HU · Ying Feng · Lei ZHU · Kui Jia

The 3D Gaussian Splatting (3DGS) gained its popularity recently by combining the advantages of both primitive-based and volumetric 3D representations, resulting in improved quality and efficiency for 3D scene rendering. However, 3DGS is not alias-free, and its rendering at varying resolutions could produce severe blurring or jaggies. This is because 3DGS treats each pixel as an isolated, single point rather than as an area, causing insensitivity to changes in the footprints of pixels. Consequently, this discrete sampling scheme inevitably results in aliasing, owing to the restricted sampling bandwidth. In this paper, we derive an analytical solution to address this issue. More specifically, we use a conditioned logistic function as the analytic approximation of the cumulative distribution function (CDF) in a one-dimensional Gaussian signal and calculate the Gaussian integral by subtracting the CDFs. We then introduce this approximation in the two-dimensional pixel shading, and present Analytic-Splatting, which analytically approximates the Gaussian integral within the 2D-pixel window area to better capture the intensity response of each pixel. Moreover, we use the approximated response of the pixel window integral area to participate in the transmittance calculation of volume rendering, making Analytic-Splatting sensitive to the changes in pixel footprint at different resolutions. Experiments on various datasets validate that our approach has better anti-aliasing capability that gives more details and better fidelity.

Wed 2 Oct. 5:00 - 5:10 PDT

FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information

Wen Jiang · BOSHU LEI · Kostas Daniilidis

This study addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields. Neural Radiance Fields (NeRF) have greatly advanced image rendering and reconstruction, but the cost of acquiring images poses the need to select the most informative viewpoints efficiently. Existing approaches depend on modifying the model architecture or hypothetical perturbation field to indirectly approximate the model uncertainty. However, selecting views from indirect approximation does not guarantee optimal information gain for the model. By leveraging Fisher Information, we directly quantify observed information on the parameters of Radiance Fields and select candidate views by maximizing the Expected Information Gain~(EIG). Our method achieves state-of-the-art results on multiple tasks, including view selection, active mapping, and uncertainty quantification, demonstrating its potential to advance the field of Radiance Fields. Our method with the 3D Gaussian Splatting backend could perform view selections at 70~fps.

Wed 2 Oct. 5:10 - 5:20 PDT

RaFE: Generative Radiance Fields Restoration

Zhongkai Wu · Ziyu Wan · Jing Zhang · Jing Liao · Dong Xu

NeRF (Neural Radiance Fields) has demonstrated tremendous potential in novel view synthesis and 3D reconstruction, but its performance is sensitive to input image quality, which struggles to achieve high-fidelity rendering when provided with low-quality sparse input viewpoints. Previous methods for NeRF restoration are tailored for specific degradation type, ignoring the generality of restoration. To overcome this limitation, we propose a generic radiance fields restoration pipeline, named RaFE, which applies to various types of degradations, such as low resolution, blurriness, noise, JPEG compression artifacts, or their combinations. Our approach leverages the success of off-the-shelf 2D restoration methods to recover the multi-view images individually. Instead of reconstructing a blurred NeRF by averaging inconsistencies, we introduce a novel approach using Adversarial Generative Networks (GANs) for NeRF generation to better accommodate the geometric and appearance inconsistencies present in the multi-view images. Specifically, we adopt a two-level triplane architecture, where the coarse level remains fixed to represent the low-quality NeRF, and a fine-level residual triplane to be added to the coarse level is modeled as a distribution with GAN to capture potential variations in restoration. We validate our proposed method on both synthetic and real cases for various restoration tasks, demonstrating superior performance in both quantitative and qualitative evaluations, surpassing other 3D restoration methods specific to single tasks.

Wed 2 Oct. 5:20 - 5:30 PDT

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Ashkan Mirzaei · Tristan T Aumentado-Armstrong · Marcus A Brubaker · Jonathan Kelly · Alex Levinshtein · Konstantinos Derpanis · Igor Gilitschenski

The success of denoising diffusion models in generating and editing images has sparked interest in using diffusion models for editing 3D scenes represented via neural radiance fields (NeRFs). However, current 3D editing methods lack a way to both pinpoint the edit location and limit changes to the desired volumetric region. Consequently, these methods often over-edit, altering irrelevant parts of the scene. We introduce a new task, 3D edit localization, to automatically identify the relevant region for an editing task and restrict the edit accordingly. To achieve this goal, we initially tackle 2D edit localization, and then lift it to multiple views to address the 3D localization challenge. For 2D localization, we leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. We refer to this discrepancy as the relevance map. The relevance map conveys the importance of changing each pixel to achieve an edit, and guides downstream modifications, ensuring that pixels irrelevant to the edit remain unchanged. With the relevance maps of multiview posed images, we can define the \textit{relevance field}, defining the 3D region within which modifications should be made. This enables us to improve the quality of text-guided 3D NeRF scene editing, by performing iterative updates on the training views, guided by renders from the relevance field. Our method achieves state-of-the-art performance on both NeRF and image editing tasks. We will make the code available.

Wed 2 Oct. 5:30 - 5:40 PDT

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Yuedong Chen · Haofei Xu · Chuanxia Zheng · Bohan Zhuang · Marc Pollefeys · Andreas Geiger · Tat-Jen Cham · Jianfei Cai

We propose MVSplat, an efficient feed-forward 3D Gaussian Splatting model learned from sparse multi-view images. To accurately localize the Gaussian centers, we propose to build a cost volume representation via plane sweeping in the 3D space, where the cross-view feature similarities stored in the cost volume can provide valuable geometry cues to the estimation of depth. We learn the Gaussian primitives' opacities, covariances, and spherical harmonics coefficients jointly with the Gaussian centers while only relying on photometric supervision. We demonstrate the importance of the cost volume representation in learning feed-forward Gaussian Splatting models via extensive experimental evaluations. On the large-scale RealEstate10K and ACID benchmarks, our model achieves state-of-the-art performance with the fastest feed-forward inference speed (22~fps). Compared to the latest state-of-the-art method pixelSplat, our model uses 10 times fewer parameters and infers more than 2 times faster while providing higher appearance and geometry quality as well as better cross-dataset generalization.

Wed 2 Oct. 5:40 - 5:50 PDT

RPBG: Towards Robust Neural Point-based Graphics in the Wild

Qingtian Zhu · Zizhuang Wei · Zhongtian Zheng · Yifan Zhan · Zhuyu Yao · Jiawang Zhang · Kejian Wu · zheng yinqiang

Point-based representations have recently gained popularity in novel view synthesis, for their unique advantages, e.g., intuitive geometric representation, simple manipulation, and faster convergence. However, based on our observation, these point-based neural re-rendering methods are only expected to perform well under ideal conditions and suffer from noisy, patchy points and unbounded scenes, which are challenging to handle but defacto common in real applications. To this end, we revisit one such influential method, known as Neural Point-based Graphics (NPBG), as our baseline, and propose Robust Point-based Graphics (RPBG). We in-depth analyze the factors that prevent NPBG from achieving satisfactory renderings on generic datasets, and accordingly reform the pipeline to make it more robust to varying datasets in-the-wild. Inspired by the practices in image restoration, we greatly enhance the neural renderer to enable the attention-based correction of point visibility and the inpainting of incomplete rasterization, with only acceptable overheads. We also seek for a simple and lightweight alternative for environment modeling and an iterative method to alleviate the problem of poor geometry. By thorough evaluation on a wide range of datasets with different shooting conditions and camera trajectories, RPBG stably outperforms the baseline by a large margin, and exhibits its great robustness over state-of-the-art NeRF-based variants. Code is provided in the Supplementary Material.

Wed 2 Oct. 5:50 - 6:00 PDT

Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

Yonggan Fu · Huaizhi Qu · Zhifan Ye · Chaojian Li · Kevin Zhao · Yingyan Lin

Recent breakthroughs in Neural Radiance Fields (NeRFs) have sparked significant demand for their integration into real-world 3D applications. However, the varied functionalities required by different 3D applications often necessitate diverse NeRF models with various pipelines, leading to tedious NeRF training for each target task and cumbersome trial-and-error experiments. Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. We achieve this by proposing a framework called Omni-Recon, which is capable of (1) generalizable 3D reconstruction and zero-shot multitask scene understanding, and (2) adaptability to diverse downstream 3D applications such as real-time rendering and scene editing. Our key insight is that an image-based rendering pipeline, with accurate geometry and appearance estimation, can lift 2D image features into their 3D counterparts, thus extending widely explored 2D tasks to the 3D world in a generalizable manner. Specifically, our Omni-Recon features a general-purpose NeRF model using image-based rendering with two decoupled branches: one complex transformer-based branch that progressively fuses geometry and appearance features for accurate geometry estimation, and one lightweight branch for predicting blending weights of source views. This design achieves state-of-the-art (SOTA) generalizable 3D surface reconstruction quality with blending weights reusable across diverse tasks for zero-shot multitask scene understanding. In addition, it can enable real-time rendering after baking the complex geometry branch into meshes, swift adaptation to achieve SOTA generalizable 3D understanding performance, and seamless integration with 2D diffusion models for text-guided 3D editing. All code will be released upon acceptance.

Wed 2 Oct. 6:00 - 6:10 PDT

Learning 3D-aware GANs from Unposed Images with Template Feature Field

XINYA CHEN · Hanlei Guo · Yanrui Bin · Shangzhan Zhang · Yuanbo Yang · Yujun Shen · Yue Wang · Yiyi Liao

Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TEFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives. Code and models will be made public.

Wed 2 Oct. 6:10 - 6:20 PDT

MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition

Aggelina Chatziagapi · Grigorios Chrysos · Dimitris Samaras

We introduce MIGS (multi-identity Gaussian splatting), a novel method that learns a single neural representation for multiple identities, using only monocular videos. Recent 3D Gaussian Splatting (3DGS) approaches for human avatars require per-identity optimization. However, learning a multi-identity representation presents advantages in robustly animating humans under arbitrary poses. We propose to construct a high-order tensor that combines all the learnable parameters of our 3DGS representation for all the training identities. By factorizing the tensor, we model the complex rigid and non-rigid deformations of multiple human subjects in a unified network using a reduced number of parameters. Our proposed approach leverages information from all the training identities, enabling robust animation under challenging unseen poses, outperforming existing approaches. We also demonstrate how it can be extended to learn unseen identities.

Main Navigation

Oral Session

Oral 4A: Neural 3D Rendering