Oral Session
Oral 7C: Optimization And Theory
Silver Room
Moderators: Qixing Huang · Vladislav Golyanik
A Direct Approach to Viewing Graph Solvability
Federica Arrigoni · Andrea Fusiello · Tomas Pajdla
The viewing graph is a useful way to represent uncalibrated cameras and their geometric relationships: nodes correspond to cameras and edges represent fundamental matrices. By analyzing this graph, it is possible to establish if the problem is "solvable" in the sense that there exists a unique (up to a single projective transformation) set of cameras that are compliant with the given fundamental matrices. In this paper, we take several steps forward in the study of viewing graph solvability: we propose a new formulation of the problem that is more direct than previous literature, based on a formula that explicitly links pairs of cameras via their fundamental matrix; we introduce the new concept of "infinitesimal solvability", demonstrating its usefulness in understanding real structure from motion graphs; we propose an algorithm for testing infinitesimal solvability and extracting components of unsolvable cases, that is more efficient than previous work; we set up an open research question on the connection between infinitesimal solvability and solvability.
Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees
Robin Kenis · Emanuel Laude · Panagiotis Patrinos
While neural network models have garnered significant attention in the imaging community, their application remains limited in important settings where optimality certificates are required or in the absence of extensive datasets. In such cases, classical models like (continuous) Markov Random Fields (MRFs) remain preferable. However, the associated optimization problem is nonconvex, and therefore very challenging to solve globally. This difficulty is further exacerbated in the case of nonconvex state spaces, such as the unit sphere. To address this, we propose a convex Semidefinite Programming (SDP) relaxation to provide lower bounds for these optimization challenges. Our relaxation provably approximates a certain infinite-dimensional convex lifting in measure spaces. Notably, our approach furnishes a certificate of (near) optimality when the relaxation (closely) approximates the unlifted problem. Our experiments show that our relaxation outperforms popular linear relaxations for many interesting problems.
Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering
Benjamin Attal · Dor Verbin · Ben Mildenhall · Peter Hedman · Jonathan T Barron · Matthew O'Toole · Pratul Srinivasan
State-of-the-art techniques for 3D reconstruction are largely based on volumetric scene representations, which require sampling multiple points to compute the color arriving along a ray. Using these representations for more general inverse rendering --- reconstructing geometry, materials, and lighting from observed images --- is challenging because recursively path-tracing such volumetric representations is prohibitively expensive. Recent works alleviate this issue through the use of radiance caches: data structures that store the steady-state, infinite-bounce radiance arriving at any point from any direction. However, these solutions rely on approximations that introduce bias into the renderings and, more importantly, into the gradients used for optimization. We present a method that avoids these approximations, while remaining computationally efficient. In particular, we leverage two techniques to reduce variance for unbiased estimators of the rendering equation: (1) a trainable, occlusion-aware importance sampler for incoming illumination and (2) a fast cache architecture that can be used as a control variate for the radiance from a high-quality, but more expensive, volumetric cache. We show that our approach, and removing these biases, improves the quality of recovered geometry and materials, especially in the presence of effects like specular reflections.
A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
Tahmina Khanam · Mohammed Bennamoun · Guan Wang · Guanjin Wang · Ferdous Sohel · Farid Boussaid · Anuj Srivastava · Hamid Laga
We propose the first comprehensive approach for modeling and analyzing the spatiotemporal shape variability in tree-like 4D objects, i.e., 3D objects whose shapes bend, stretch and change in their branching structure over time as they deform, grow, and interact with their environment. Our key contribution is the representation of tree-like 3D shapes using Square Root Velocity Function Trees (SRVFT). By solving the spatial registration in the SRVFT space, which is equipped with an $\ltwo$ metric, 4D tree-shaped structures become time-parameterized trajectories in this space. This reduces the problem of modeling and analyzing 4D tree-like shapes to that of modeling and analyzing elastic trajectories in the SRVFT space, where elasticity refers to time warping. In this paper, we propose a novel mathematical representation of the shape space of such trajectories, a Riemannian metric on that space, and computational tools for fast and accurate spatiotemporal registration and geodesics computation between 4D tree-shaped structures. Leveraging these building blocks, we develop a full framework for modelling the spatiotemporal variability using statistical models and generating novel 4D tree-like structures from a set of exemplars. We demonstrate and validate the proposed framework using real 4D plant data. The code will be available on Github.
Physics-Based Interaction with 3D Objects via Video Generation
Tianyuan Zhang · Hong-Xing Yu · Rundi Wu · Brandon Y Feng · Changxi Zheng · Noah Snavely · Jiajun Wu · William Freeman
Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these properties, such as object stiffness. However, estimating physical material properties is an open problem due to the lack of material ground-truth data, as measuring these properties for real objects is highly difficult. We present PhysDreamer, a physics-based approach that endows static 3D objects with interactive dynamics by leveraging the object dynamics priors learned by video generation models. By distilling these priors, PhysDreamer enables the synthesis of realistic object responses to novel interactions, such as external forces or agent manipulations. We demonstrate our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study. PhysDreamer takes a step towards more engaging and realistic virtual experiences by enabling static 3D objects to dynamically respond to interactive stimuli in a physically plausible manner.
Shape from Heat Conduction
Sriram Narayanan · Mani Ramanagopal · Mark Sheinin · Aswin C. Sankaranarayanan · Srinivasa G. Narasimhan
Thermal cameras measure the temperature of objects based on radiation emitted in the infrared spectrum. In this work, we propose a novel shape recovery approach that exploits the properties of heat transport, specifically heat conduction, induced on objects when illuminated using simple light bulbs. While the resulting heat transport occurs in the entirety of an object's volume, we show a surface approximation that enables shape recovery and empirically analyze its validity for objects with varying thicknesses. We develop an algorithm that solves a linear system of equations to estimate the intrinsic shape Laplacian for the first time from thermal videos along with several properties including heat capacity, convection coefficient, and absorbed heat flux under uncalibrated lighting of arbitrary shapes. Further, we propose a novel shape from Laplacian objective that aims to resolve the inherent shape ambiguities by drawing insights from absorbed heat flux images using two unknown lights sources. Finally, we devise a coarse-to-fine refinement strategy that faithfully recovers low- and high-frequency shape details. We validate our method by showing accurate reconstructions, to within an error of 1-2 mm, in both simulations and from noisy thermal videos of real-world objects with complex shapes and materials.
Rasterized Edge Gradients: Handling Discontinuities Differentially
Stanislav Pidhorskyi · Tomas Simon · Gabriel Schwartz · He Wen · Yaser Sheikh · Jason Saragih
Computing the gradients of a rendering process is paramount for diverse applications in computer vision and graphics. However, accurate computation of these gradients is challenging due to discontinuities and rendering approximations, particularly for surface-based representations and rasterization-based rendering. We present a novel method for computing gradients at visibility discontinuities for rasterization-based differentiable renderers. Our method elegantly simplifies the traditionally complex problem through a carefully designed approximation strategy, allowing for a straightforward, effective, and performant solution. We introduce a novel concept of micro-edges, which allows us to treat the rasterized images as outcomes of a differentiable, continuous process aligned with the inherently non-differentiable, discrete-pixel rasterization. This technique eliminates the necessity for rendering approximations or other modifications to the forward pass, preserving the integrity of the rendered image, which makes it applicable to rasterized masks, depth, and normals images where filtering is prohibitive. Utilizing micro-edges simplifies gradient interpretation at discontinuities and enables handling of geometry intersections, offering an advantage over the prior art. We showcase our method in dynamic human head scene reconstruction, demonstrating effective handling of camera images and segmentation masks.
ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
Denis Zavadski · Johann-Friedrich Feiden · Carsten Rother
The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. In state-of-the-art approaches, this guidance is realized by a separate controlling model that controls a pre-trained image generation network, such as a latent diffusion model. Understanding this process from a control system perspective shows that it forms a feedback-control system, where the control module receives a feedback signal from the generation process and sends a corrective signal back. When analysing existing systems, we observe that the feedback signals are timely sparse and have a small number of bits. As a consequence, there can be long delays between newly generated features and the respective corrective signals for these features. It is known that this delay is the most unwanted aspect of any control system. In this work, we take an existing controlling network (ControlNet) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth. By doing so, we are able to considerably improve the quality of the generated images, as well as the fidelity of the control. Also, the controlling network needs noticeably fewer parameters and hence is about twice as fast during inference and training time. Another benefit of small-sized models is that they help to democratise our field and are likely easier to understand. We call our proposed network ControlNet-XS. When comparing with the state-of-the-art approaches, we outperform them for pixel-level guidance, such as depth, canny-edges, and semantic segmentation, and are on a par for loose keypoint-guidance of human poses. All code and pre-trained models will be made publicly available.
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
Seung Hyun Lee · Yinxiao Li · Junjie Ke · Innfarn Yoo · Han Zhang · Jiahui Yu · Qifei Wang · Fei Deng · Glenn Entis · Junfeng He · Gang Li · Sangpil Kim · Irfan Essa · Feng Yang
Recent reinforcement learning (RL) works have demonstrated that using multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt in inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.
Model Stock: All we need is just a few fine-tuned models
Dong-Hwan Jang · Sangdoo Yun · Dongyoon Han
This paper introduces a novel fine-tuning method for large pre-trained models, offering strong performance with further efficiency. Breaking away from traditional practices that average a multitude of fine-tuned models for accuracy improvements, our approach uses significantly fewer models to optimize final weights yet achieve superior accuracy. Based on the crucial observations of the dynamics in fine-tuned models' weight space, our novel layer-wise averaging technique could surpass state-of-the-art model averaging methods such as Model Soup only with just two fine-tuned models. This strategy can be more aptly coined like Model Stock, reflecting its reliance on selecting very few models to draw a more optimized-averaged model. We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures, achieving remarkable performance on both in-distribution (ID) and out-of-distribution (OOD) tasks on the standard benchmarks, all while barely bringing extra computational demands. Our code and pre-trained models will be made publicly available.