Track: Poster Session 2

# 172

Strong Double Blind

Zero-Shot Detection of AI-Generated Images

Davide Cozzolino · GIovanni Poggi · Matthias Niessner · Luisa Verdoliva

Detecting AI-generated images has become an extraordinarily difficult challenge as new generative architectures emerge on a daily basis with more and more capabilities and unprecedented realism. New versions of many commercial tools, such as DALL·E, Midjourney, and Stable Diffusion, have been released recently, and it is impractical to continually update and retrain supervised forensic detectors to handle such a large variety of models. To address this challenge, we propose a zero-shot entropy-based detection method (ZSdet) that neither needs AI-generated training data nor relies on knowledge of generative architectures to artificially synthesize their artifacts. Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images. To this end, we rely on a lossless image encoder that is able to estimate the probability distribution of each pixel given its context. To ensure computational efficiency, the encoder has a multi-resolution architecture and contexts comprise mostly pixels of the lower-resolution version of the image. Since only real images are needed to learn the model, the detector is independent of generator architectures and synthetic training data. Using a single discriminative feature, the proposed detector achieves state-of-the-art performance. On a wide variety of generative models it achieves an average improvement of more than 3% over the SoTA in terms of accuracy.

# 22

MobileNetV4: Universal Models for the Mobile Ecosystem

Danfeng Qin · Chas Leichner · Manolis Delakis · Marco Fornoni · Shixin Luo · Fan Yang · Weijun Wang · Colby Banbury · Chengxi Ye · Berkin Akin · Vaibhav Aggarwal · Tenghui Zhu · Daniele Moro · Andrew Howard

We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe was also crafted to improve MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Our approach emphasizes simplicity, utilizing standard components and a straightforward attention mechanism to ensure broad hardware compatibility. To further boost efficiency, we finally introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers impressive 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.

# 146

Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

Nina Weng · Paraskevas Pegios · Eike Petersen · Aasa Feragen · Siavash Bigdeli

Shortcut learning is when a model -- e.g. a cardiac disease classifier -- exploits correlations between the target label and a spurious shortcut feature, e.g. a pacemaker, to predict the target label based on the shortcut rather than real discriminative features. This is common in medical imaging, where treatment and clinical annotations correlate with disease labels, making them easy shortcuts to predict disease. We propose a novel detection and quantification of the impact of potential shortcut features via diffusion-based counterfactual image generation that can synthetically remove or add shortcuts. Via a novel self-optimized masking scheme we spatially limit the changes made with no extra inference step, encouraging the removal of spatially constrained shortcut features while ensuring that the shortcut-free counterfactuals preserve their remaining image features to a high degree. Using these, we assess how shortcut features influence model predictions. This is enabled by our second contribution: An efficient diffusion-based counterfactual generation with significant inference speed-up at comparable counterfactual quality as state-of-the-art. We confirm our performance on two large chest X-ray datasets, a skin lesion dataset, and CelebA.

# 24

Strong Double Blind

Adaptive Parametric Activation

Konstantinos P Alexandridis · Jiankang Deng · Anh Nguyen · Shan Luo

The activation function plays a crucial role in model optimisation, yet the optimal choice remains unclear. For example, the Sigmoid activation is the de-facto activation in balanced classification tasks, however, in imbalanced classification, it proves inappropriate due to bias towards frequent classes. In this work, we delve deeper in this phenomenon by performing a comprehensive statistical analysis in the classification and intermediate layers of both balanced and imbalanced networks and we empirically show that aligning the activation function with the data distribution, enhances the performance in both balanced and imbalanced tasks. To this end, we propose the Adaptive Parametric Activation (APA) function, a novel and versatile activation function that unifies most common activation functions under a single formula. APA can be applied in both intermediate layers and attention layers, significantly outperforming the state-of-the-art on several imbalanced benchmarks such as ImageNet-LT, iNaturalist2018, Places-LT, CIFAR100-LT and LVIS and balanced benchmarks such as ImageNet1K, COCO and V3DET. Code is provided in the supplementary material.

# 77

Strong Double Blind

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

Wuyang Li · Xinyu Liu · Jiayi Ma · Yixuan Yuan

Open-vocabulary object detection (OVD) utilizes image-level cues to expand the linguistic space of region proposals, thereby facilitating the detection of diverse novel classes. Recent works adapt CLIP embedding by minimizing the object-image and object-text discrepancy combinatorially in a discriminative paradigm. However, they ignore the underlying distribution and the disagreement between the image and text objective, leading to the misaligned distribution between the vision and language sub-space. To address the deficiency, we explore the advanced generative paradigm with distribution perception and propose a novel framework based on the diffusion model, coined Continual Latent Diffusion (CLIFF), which formulates a continual distribution transfer among the object, image, and text latent space probabilistically. CLIFF consists of a Variational Latent Sampler (VLS) enabling the probabilistic modeling and a Continual Diffusion Module (CDM) for the distribution transfer. Specifically, in VLS, we first establish a probabilistic object space with region proposals by estimating distribution parameters. Then, the object-centric noise is sampled from the estimated distribution to generate text embedding for OVD. To achieve this generation process, CDM conducts a short-distance object-to-image diffusion from the sampled noise to generate image embedding as the medium, which guides the long-distance diffusion to generate text embedding. Extensive experiments verify that CLIFF can significantly surpass state-of-the-art methods on benchmarks. Codes will be released to bring insights to the community.

# 145

Dataset Enhancement with Instance-Level Augmentations

Orest Kupyn · Christian Rupprecht

We present a method for expanding a dataset by incorporating knowledge from the wide distribution of pre-trained latent diffusion models. Data augmentations typically incorporate inductive biases about the image formation process into the training (e.g. translation, scaling, colour changes, etc.). Here, we go beyond simple pixel transformations and introduce the concept of instance-level data augmentation by repainting parts of the image at the level of object instances. The method combines a conditional diffusion model with depth and edge maps control conditioning to seamlessly repaint individual objects inside the scene, being applicable to any segmentation or detection dataset. Used as a data augmentation method, it improves the performance and generalization of the state-of-the-art salient object detection, semantic segmentation and object detection models. By redrawing all privacy-sensitive instances (people, license plates, etc.), the method is also applicable for data anonymization. We also release fully synthetic and anonymized expansions for popular datasets: COCO, Pascal VOC and DUTS. All datasets and the code will be released.

# 27

Strong Double Blind

Efficient Bias Mitigation Without Privileged Information

Mateo Espinosa Zarlenga · Sankaranarayanan · Jerone Andrews · Zohreh Shams · Mateja Jamnik · Alice Xiang

Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups, particularly when group and task labels are spuriously correlated (e.g., grassy background'' andcows''). Existing bias mitigation methods that aim to address this issue often either rely on group labels for training or validation, or require an extensive hyperparameter search. Such data and computational requirements hinder the practical deployment of these methods, especially when datasets are too large to be group-annotated, computational resources are limited, and models are trained through already complex pipelines. In this paper, we propose Targeted Augmentations for Bias Mitigation (TAB), a simple hyperparameter-free framework that leverages the entire training history of a helper model to identify spurious samples, and generate a group-balanced training set from which a robust model can be trained. We show that TAB improves worst-group performance without any group information or model selection, outperforming existing methods while maintaining overall accuracy.

# 148

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

Selim Kuzucu · Kemal Oksuz · Jonathan Sadeghi · Puneet Dokania

Building calibrated object detectors is a crucial challenge to address for their reliable usage in safety-critical applications. Recent approaches towards this involve (1) designing new loss functions to obtain calibrated detectors by training them from scratch, and (2) post-hoc Temperature Scaling (TS) that learns to scale the likelihood of a trained detector to output calibrated predictions. These approaches are then evaluated based on a combination of Detection Expected Calibration Error (D-ECE) and Average Precision. In this work, via extensive analysis and insights, we highlight that these recent evaluation frameworks, evaluation metrics, and the use of TS have significant drawbacks leading to incorrect conclusions. As a remedy, we propose a principled evaluation framework to jointly measure calibration and accuracy of object detectors. We also tailor efficient and easy-to-use post-hoc calibration approaches, Platt Scaling and Isotonic Regression, specifically to object detection. As opposed to the common notion, our experiments show that, once designed and evaluated properly, post-hoc calibrators, which are extremely cheap to build and use, are much more powerful and effective than the recent train time calibration methods. To illustrate, D-DETR with our post-hoc Isotonic Regression calibrator outperforms the state-of-the-art Cal-DETR by more than 7 D-ECE on the COCO dataset. We also provide improved versions of Localization-aware ECE and show the efficacy of our method on these metrics as well. Code will be made public.

# 26

Strong Double Blind

Momentum Auxiliary Network for Supervised Local Learning

Junhao Su · Changpeng Cai · Feiyu Zhu · Chenghao He · Xiaojie Xu · Dongzhi Guan · Chenyang Si

Deep neural networks conventionally employ end-to-end backpropagation for their training process, which lacks biological credibility and triggers a locking dilemma during network parameter updates, leading to significant GPU memory use. Supervised local learning, which segments the network into multiple local blocks updated by independent auxiliary networks. However, these methods cannot replace end-to-end training due to lower accuracy, as gradients only propagate within their local block, creating a lack of information exchange between blocks. To address this issue and establish information transfer across blocks, we propose a Momentum Auxiliary Network (MAN) that establishes a dynamic interaction mechanism. The MAN leverages an exponential moving average (EMA) of the parameters from adjacent local blocks to enhance information flow. This auxiliary network, updated through EMA, helps bridge the informational gap between blocks. Nevertheless, we observe that directly applying EMA parameters has certain limitations due to feature discrepancies among local blocks. To overcome this, we introduce learnable biases, further boosting performance. We have validated our method on four image classification datasets (CIFAR-10, STL-10, SVHN, ImageNet), attaining superior performance and substantial memory savings. Notably, our method can reduce GPU memory usage by more than 45% on the ImageNet dataset compared to end-to-end training, while achieving higher performance. The Momentum Auxiliary Network thus offers a new perspective for supervised local learning.

# 14

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Maan Qraitem · Kate Saenko · Bryan Plummer

Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from off-the-shelf large-scale generative models offers a promising direction to mitigate this issue by augmenting underrepresented subgroups in the real dataset. However, by using a mixed distribution of real and synthetic data, we introduce another source of bias due to distributional differences between synthetic data and real data (\eg synthetic artifacts). As we will show, prior work methods for using synthetic data to resolve the model's bias toward $B$ doesn't correct the models' bias toward the pair $(B, G)$ where $G$ denotes whether the sample is real or synthetic. Thus, the model could simply learn signals based on the pair $(B, G)$ (\eg, Synthetic Indoors) to make predictions about $Y$ (\eg, Big Dogs). To address this issue, we propose a simple, easy to implement, two-step training pipeline that we call From Fake to Real (FFR). The first step of FFR pre-trains a model on balanced synthetic data to learn robust representations across subgroups. In the second step, FFR fine-tunes the model on real data using ERM or common loss-based bias mitigation methods. By training on real and synthetic data separately, FFR doesn't expose the model to the statistical differences between real and synthetic data and thus avoids the issue of bias toward the pair $(B, G)$. Our experiments show that FFR improves worst accuracy over the state-of-the-art by up to 20\% over three datasets (CelebA, UTK-Face, and SpuCO Animals).

# 169

Strong Double Blind

Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

Zeyang Zhao · Qilong Xue · Yifan Bai · Yuhang He · Xing Wei · Yihong Gong

This paper introduces the Point-Axis representation for oriented objects in aerial images, as depicted in Figure 1, emphasizing its flexibility and geometrically intuitive nature with two key components: points and axes. 1) Points delineate the spatial extent and contours of objects, providing detailed shape descriptions. 2) Axes define the primary directionalities of objects, providing essential orientation cues crucial for precise detection. The point-axis representation decouples location and rotation, addressing the loss discontinuity issues commonly encountered in traditional bounding box based approaches. For effective optimization without introducing additional annotations, we propose the max-projection loss to supervise point set learning and the cross-axis loss for robust axis representation learning. Further, leveraging this representation, we present the Oriented DETR model, seamlessly integrating the DETR framework for precise point-axis prediction and end-to-end detection. Experimental results demonstrate effectiveness in main datasets, showing significant performance improvements in aerial-oriented object detection tasks. The code will be released to the community.

# 147

Strong Double Blind

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

Xiuquan Hou · Meiqin Liu · Senlin Zhang · Ping Wei · Badong Chen · Xuguang Lan

This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in Transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the verification of its statistical significance using a proposed quantitative macrosopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinment, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a significant improvement (+2.0% AP) on COCO 2017 compared to DINO, the previous highly optimized DETR detector, and achieves state-of-the-art performance (reaching 51.7% at 12 epochs and 52.1% AP at 24 epochs with a ResNet-50 backbone, respectively). Moreover, Relation-DETR exhibits a remarkably fast convergence speed, achieving over 40% AP with only 2 training epochs on COCO 2017 using the basic ResNet50 backbone, suppressing existing DETR detectors under the same settings. Furthermore, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. The code will be available.

# 153

Strong Double Blind

ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images

Xiaoshuai Zhang · Zhicheng Wang · Howard Zhou · Soham Ghosh · Danushen L Gnanapragasam · Varun Jampani · Hao Su · Leonidas Guibas

To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with 2D-3D consistency -- condensing 3D NeRF representations into compact sets of decorated key points. We demonstrate that our pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin. It also enables, by exploiting our sparse features, additional useful downstream tasks, such as matching 2D images to 3D scenes, detecting duplicate scenes, and querying a repository of 3D scenes through natural language -- all quite efficiently and without any per-scene fine-tuning.

# 182

Strong Double Blind

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

Hao Tang · Weiyao Wang · Pierre Gleize · Matt Feiszli

Recovering camera poses from a set of images is a foundational task in 3D computer vision, which powers key applications such as 3D scene/object reconstructions. Classic methods often depend on feature correspondence, such as keypoints, which require the input images to have large overlap and small viewpoint changes. Such requirements present considerable challenges in scenarios with sparse views. Recent data-driven approaches aim to directly output camera poses, either through regressing the 6DoF camera poses or formulating rotation as a probability distribution. However, each approach has its limitations. On one hand, directly regressing the camera poses can be ill-posed, since it assumes a single mode, which is not true under symmetry and leads to sub-optimal solutions. On the other hand, probabilistic approaches are capable of modeling the symmetry ambiguity, yet they sample the entire space of rotation uniformly by brute-force. This leads to an inevitable trade-off between high sample density, which improves model precision, and sample efficiency that determines the runtime. In this paper, we propose ADen to unify the two frameworks by employing a generator and a discriminator: the generator is trained to output multiple hypotheses of 6DoF camera pose to represent a distribution and handle multi-mode ambiguity, and the discriminator is trained to identify the hypothesis that best explains the data. This allows ADen to combine the best of both worlds, achieving substantially higher precision as well as lower runtime than previous methods in empirical evaluations.

# 181

COMO: Compact Mapping and Odometry

Eric Dexheimer · Andrew Davison

We present COMO, a real-time monocular mapping and odometry system that encodes dense geometry via a compact set of 3D anchor points. Decoding anchor point projections into dense geometry via per-keyframe depth covariance functions guarantees that depth maps are joined together at visible anchor points. The representation enables joint optimization of camera poses and dense geometry, intrinsic 3D consistency, and efficient second-order inference. To maintain a compact yet expressive map, we introduce a frontend that leverages the covariance function for tracking and initializing potentially visually indistinct 3D points across frames. Altogether, we introduce a real-time system capable of estimating accurate poses and consistent geometry.

# 193

Strong Double Blind

Camera Calibration using a Collimator System

Shunkun Liang · Banglei Guan · Zhenbao Yu · Pengju Sun · Yang Shang

Camera calibration is a crucial step in photogrammetry and 3D vision applications. In practical application scenarios where the camera has a long working distance to cover a wide area, target-based calibration methods become complicated and inflexible due to site limitations. This paper introduces a novel camera calibration method using a designed collimator system. This allows the camera to clearly observe the calibration pattern at a close range. Based on the optical geometry of collimator system, we prove that the motion of the calibration target conforms to the spherical motion model with respect to the camera. The spherical motion constraint reduces the original 6DOF motion to 3DOF pure rotation. Moreover, a closed-form solver for multiple views and a minimal solver for two views are proposed for camera calibration. The performance of the proposed method is tested in both synthetic and real-world experiments, which demonstrates that the calibration accuracy is superior to the state-of-the-art methods.

# 186

Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

Kohei Yamashita · Vincent Lepetit · Ko Nishino

Computer vision has long relied on two kinds of correspondences: pixel correspondences in images and 3D correspondences on object surfaces. Is there another kind, and if there is, what can they do for us? In this paper, we introduce correspondences of the third kind we call reflection correspondences and show that they can help estimate camera pose by just looking at objects without relying on the background. Reflection correspondences are point correspondences in the reflected world, i.e., the scene reflected by the object surface. The object geometry and reflectance alters the scene geometrically and radiometrically, respectively, causing incorrect pixel correspondences. Geometry recovered from each image is also hampered by distortions, namely generalized bas-relief ambiguity, leading to erroneous 3D correspondences. We show that reflection correspondences can resolve the ambiguities arising from these distortions. We introduce a neural correspondence estimator and a RANSAC algorithm that fully leverages all three kinds of correspondences for robust and accurate joint camera pose and object shape estimation just from the object appearance. The method expands the horizon of numerous downstream tasks, including camera pose estimation for appearance modeling (e.g., NeRF) and motion estimation of reflective objects (e.g., cars on the road), to name a few, as it relieves the requirement of overlapping background.

# 222

Strong Double Blind

Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition

Satoshi Ikehata · Yuta Asano

In this paper, we present a groundbreaking spectrally multiplexed photometric stereo approach for recovering surface normals of dynamic surfaces without the need for calibrated lighting or sensors, a notable advancement in the field traditionally hindered by stringent prerequisites and spectral ambiguity. By embracing spectral ambiguity as an advantage, our technique enables the generation of training data without specialized multispectral rendering frameworks. We introduce a unique, physics-free network architecture, SpectraM-PS, that effectively processes multiplexed images to determine surface normals across a wide range of conditions and material types, without relying on specific physically-based knowledge. Additionally, we establish the first benchmark dataset, SpectraM14, for spectrally multiplexed photometric stereo, facilitating comprehensive evaluations against existing calibrated methods. Our contributions significantly enhance the capabilities for dynamic surface recovery, particularly in uncalibrated setups, marking a pivotal step forward in the application of photometric stereo across various domains.

# 176

SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

Niklas Gard · Anna Hilsmann · Peter Eisert

In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only compromises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose.

# 210

Strong Double Blind

Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss

Alex Rich · Noah Stier · Pradeep Sen · Tobias Hollerer

Despite significant progress in unsupervised multi-view stereo (MVS), the core loss formulation has remained largely unchanged since its introduction. However, we identify fundamental limitations to this core loss and propose three major changes to improve the modeling of depth priors, occlusion, and view-dependent effects. First, we eliminate prominent stair-stepping and edge artifacts in predicted depth maps using a clamped depth-smoothness constraint. Second, we propose a learned view-synthesis approach to generate an image for photometric loss, avoiding the use of hand-coded heuristics for handling view-dependent effects. Finally, we sample additional views for supervision beyond those used as MVS input, challenging the network to predict depth that matches unseen views. Together, these contributions form an improved supervision strategy we call DIV loss. The key advantage of our DIV loss is that it can be easily dropped into existing unsupervised MVS training pipelines, resulting in significant improvements on competitive reconstruction benchmarks and drastically better qualitative performance around object boundaries for minimal training cost.

# 183

Six-Point Method for Multi-Camera Systems with Reduced Solution Space

Banglei Guan · Ji Zhao · Laurent Kneip

Relative pose estimation using point correspondences (PC) is a widely used technique. A minimal configuration of six PCs is required for generalized cameras. In this paper, we present several minimal solvers that use six PCs to compute the 6DOF relative pose of a multi-camera system, including a minimal solver for the generalized camera and two minimal solvers for the practical configuration of two-camera rigs. The equation construction is based on the decoupling of rotation and translation. Rotation is represented by Cayley or quaternion parametrization, and translation can be eliminated by using the hidden variable technique. Ray bundle constraints are found and proven when a subset of PCs relate the same cameras across two views. This is the key to reducing the number of solutions and generating numerically stable solvers. Moreover, all configurations of six-point problems for multi-camera systems are enumerated. Extensive experiments demonstrate that our solvers are more accurate than the state-of-the-art six-point methods, while achieving better performance in efficiency.

# 185

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

Eric Brachmann · Jamie Wynn · Shuai Chen · Tommaso Cavallari · Áron Monszpart · Daniyar Turmukhambetov · Victor Adrian Prisacariu

We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. Our method, ACE0, estimates camera poses to an accuracy comparable to feature-based SfM, as demonstrated by novel view synthesis.

# 170

Grounding Image Matching in 3D with MASt3R

Vincent Leroy · Yohann Cabon · Jerome Revaud

Image Matching is a core component of all best-performing algorithms and pipelines in 3D vision. Yet despite matching being fundamentally a 3D problem, intrinsically linked to camera pose and scene geometry, it is typically treated as a 2D problem. This makes sense as the goal of matching is to establish correspondences between 2D pixel fields, but also seems like a potentially hazardous choice. In this work, we take a different stance and propose to cast matching as a 3D task with DUSt3R, a recent and powerful 3D reconstruction framework based on Transformers. Based on pointmaps regression, this method displayed impressive robustness in matching views with extreme viewpoint changes, yet with limited accuracy. We aim here to improve the matching capabilities of such an approach while preserving its robustness. We thus propose to augment the DUSt3R network with a new head that outputs dense local features, trained with an additional matching loss. We further address the issue of quadratic complexity of dense matching, which becomes prohibitively slow for downstream applications if not treated carefully. We introduce a fast reciprocal matching scheme that not only accelerates matching by orders of magnitude, but also comes with theoretical guarantees and, lastly, yields improved results. Extensive experiments show that our approach, coined MASt3R, significantly outperforms the state of the art on multiple matching tasks. In particular, it beats the best published methods by 230% (relative improvement) in VCRE Precision on the extremely challenging Map-free localization dataset.

# 151

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

Shuai Tan · Bin Ji · Mengxiao Bi · ye pan

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs—both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk.

# 158

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Jingye Chen · Yupan Huang · Tengchao Lv · Lei Cui · Qifeng Chen · Furu Wei

The diffusion model has been proven a powerful generative model in recent years, yet it remains a challenge in generating visual text. Although existing work has endeavored to enhance the accuracy of text rendering, these methods still suffer from several drawbacks, such as (1) limited flexibility and automation, (2) constrained capability of layout prediction, and (3) restricted diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering while taking these three aspects into account. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords and placing the text in optimal positions for text rendering. Secondly, we utilize the language model within the diffusion model to encode the position and content of keywords at the line level. Unlike previous methods that employed tight character-level guidance, our approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants and GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. Furthermore, the proposed methods are compatible with existing text rendering techniques, such as TextDiffuser and GlyphControl, serving to enhance automation and diversity, as well as augment the rendering accuracy. For instance, by using the proposed layout planner, TextDiffuser is capable of rendering text with more aesthetically pleasing line breaks and alignment, meanwhile obviating the need for explicit keyword specification. Furthermore, GlyphControl can leverage the layout planner to achieve diverse layouts without the necessity for user-specified glyph images, and the rendering F-measure can be boosted by 6.51\% when using the proposed layout encoding training technique. The code and model will be available to promote future research.

# 149

Accelerating Image Generation with Sub-path Linear Approximation Model

Chen Xu · Tianhui Song · Weixin Feng · Xubin Li · Tiezheng Ge · Bo Zheng · Limin Wang

Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the approximation strategies utilized in consistency models, we propose the Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models while maintaining high-quality image generation. SLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along %the endpoints of each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative approximated errors. An efficient distillation method is also developed to facilitate the incorporation of more advanced diffusion models, such as latent diffusion models. Our extensive experimental results demonstrate that SLAM achieves an efficient training regimen, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation with high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SLAM surpasses existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images.

# 152

SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation

Heyuan Li · Ce Chen · Tianhao Shi · Yuda Qiu · Sizhe An · Guanying Chen · XIAOGUANG HAN

While recent advances in 3D-aware Generative Adversarial Networks (GANs) have aided the development of near-frontal view human face synthesis, the challenge of comprehensively synthesizing a full 3D head viewable from all angles still persists. Although PanoHead proves the possibilities of using a large-scale dataset with images of both frontal and back views for full-head synthesis, it often causes artifacts for back views. Based on our in-depth analysis, we found the reasons are mainly twofold. First, from network architecture perspective, we found each plane in the utilized tri-plane/tri-grid representation space tends to confuse the features from both sides, causing mirroring'' artifacts (e.g., the glasses appear in the back). Second, from data supervision aspect, we found that existing discriminator training in 3D GANs only focuses on the quality of the rendered image itself, and does not care about its plausibility with the perspective from which it was rendered. This makes it possible to generateface'' in the non-frontal view, due to its easiness to fool the discriminator. In response, we propose SphereHead, a novel tri-plane representation in the spherical coordinate system that fits the human head's geometric characteristics and efficiently mitigates many of the generated artifacts. We further introduce a view-image consistency loss for the discriminator to emphasize the correspondence of the camera labels and the images. The combination of these efforts results in visually superior outcomes with significantly fewer artifacts. Our code and dataset is publicly available at https://lhyfst.github.io/spherehead/.

# 150

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen · Puyuan Peng · Ami Baid · Zihui Xue · Wei-Ning Hsu · David Harwath · Kristen Grauman

Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals---resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

# 157

LLMGA: Multimodal Large Language Model based Generation Assistant

Bin Xia · Shiyin Wang · Yingfan Tao · Yitong Wang · Jiaya Jia

In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting \& outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications. in an interactive manner.

# 159

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Kangle Deng · Timothy Omernick · Alexander B Weiss · Deva Ramanan · Jun-Yan Zhu · Tinghui Zhou · Maneesh Agrawala

Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. Our method introduces LightControlNet, a new text-to-image model based on the ControlNet architecture, that allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. We show that this pipeline is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.

# 154

Strong Double Blind

Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture

ShahRukh Athar · Shunsuke Saito · Stanislav Pidhorskyi · Zhengyu Yang · Chen Cao

Creating photorealistic avatars for individuals traditionally involves extensive capture sessions with complex and expensive studio devices like the LightStage system. While recent strides in neural representations have enabled the generation of photorealistic and animatable 3D avatars from quick phone scans, they have baked-in capture-time lighting, lack facial details and have missing regions in areas such as the back of the ears. Thus, they lag in quality compared to studio-captured avatars. In this paper, we propose a method to bridge this gap by generating studio-like illuminated texture maps from short, monocular phone captures. We do this by parameterizing the phone texture maps using the W+ space of a StyleGAN2, enabling near-perfect reconstruction. Then, we finetune a StyleGAN2 by sampling in the W+ parameterized space using a very small set of studio-captured textures as an adversarial training signal. To further enhance the realism and accuracy of facial details, we super-resolve the output of the StyleGAN2 using a diffusion model conditioned on image gradients of the phone-captured texture map. Once trained, our method excels at producing studio-like facial texture maps from casual monocular smartphone videos. Demonstrating its capabilities, we showcase the generation of photorealistic, uniformly lit, complete avatars from monocular phone captures.

# 156

TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

Yufei Liu · Junwei Zhu · Junshu Tang · Shijie Zhang · Jiangning Zhang · Weijian Cao · Chengjie Wang · Yunsheng Wu · Dongjin Huang

Texturing 3D humans with semantic UV maps remains a challenge due to the difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D advancements in supervising multi-view renderings using large text-to-image (T2I) models, issues persist with generation speed, text consistency, and texture quality, resulting in data scarcity among existing datasets. We present TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model. Utilizing an efficient texture adaptation finetuning strategy, we adapt large T2I model to a semantic UV structure while preserving its original generalization capability. Leveraging a novel feature translator module, the trained model is capable of generating high-fidelity 3D human textures from either text or image within seconds. Furthermore, we introduce ArTicuLated humAn textureS (ATLAS), the largest high-resolution (1, 024 × 1, 024) 3D human texture dataset which contains 50k high-fidelity textures with text descriptions. Our dataset and model will be available for research purposes.

# 283

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Linrui Tian · Qi Wang · Bang Zhang · Liefeng Bo

In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations. Experimental results demonsrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism.

# 287

Strong Double Blind

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Qianyun He · Xinya Ji · Yicheng Gong · Yuanxun Lu · Zhengyu Diao · Linjia Huang · Yao Yao · Siyu Zhu · Zhan Ma · Xu Songcen · Xiaofei Wu · Zixiao Zhang · Xun Cao · Hao Zhu

We present a novel approach for synthesizing emotion-controllable 3D talking heads, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a 'Speech-to-Geometry-to-Appearance' mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model extracts emotion labels from the input speech and enables controllable emotion in the generated talking heads. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset will be publicly released upon publication.

# 290

Strong Double Blind

3D Gaussian Parametric Head Model

Yuelang Xu · Lizhen Wang · Zerong Zheng · Zhaoqi Su · Yebin Liu

Creating high-fidelity 3D human head avatars is crucial for applications in VR/AR, telepresence, digital human interfaces, and film production. Recent advances have leveraged morphable face models to generate animated head avatars from easily accessible data, representing varying identities and expressions within a low-dimensional parametric space. However, existing methods often struggle with modeling complex appearance details, e.g., hairstyles and accessories, and suffer from low rendering quality and efficiency. This paper introduces a novel approach, 3D Gaussian Parametric Head Model, which employs 3D Gaussians to accurately represent the complexities of the human head, allowing precise control over both identity and expression. Additionally, it enables seamless face portrait interpolation and the reconstruction of detailed head avatars from a single image. Unlike previous methods, the Gaussian model can handle intricate details, enabling realistic representations of varying appearances and complex expressions. Additionally, this paper presents a well-designed training framework to ensure smooth convergence, providing a robust guarantee for learning the rich content. Our method achieves high-quality, photo-realistic rendering with real-time efficiency, making it a valuable contribution to the field of parametric head models.

# 286

Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

Ekta Prashnani · Koki Nagano · Shalini De Mello · David P Luebke · Orazio Gallo

Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their consent? We term this task avatar fingerprinting. To tackle it, we first introduce a large-scale dataset of real and synthetic videos of people interacting on a video call, where the synthetic videos are generated using the facial appearance of one person and the expressions of another. We verify the identity driving the expressions in a synthetic video, by learning motion signatures that are independent of the facial appearance shown. Our solution, the first in this space, achieves an average AUC of 0.85. Critical to its practical use, it also generalizes to new generators never seen in training (average AUC of 0.83).

# 294

Strong Double Blind

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Bowen Zhang · Yiji Cheng · Chunyu Wang · Ting Zhang · Jiaolong Yang · Yansong Tang · Feng Zhao · DONG CHEN · Baining Guo

We address the task of generating high-fidelity 3D avatars, represented as triplanes, from a frontal view portrait image. Existing methods struggle to capture intricate details such as cloth textures and hairstyles which we tackle in this paper. Specifically, we first identify an overlooked problem of catastrophic forgetting that arises when fitting triplanes sequentially on a large number of avatars, caused by the MLP decoder sharing scheme. To overcome this issue, we introduce a novel data scheduling strategy called task replay and a weight consolidation regularization term, which effectively improves the decoder's capability of rendering sharper details and unleashes the full power of triplanes for high-fidelity generation. Additionally, we maximize the guiding effect of the conditional portrait image by computing a finer-grained hierarchical representation that captures rich 2D texture cues, and injecting them to the 3D diffusion model at multiple layers via cross-attention. When trained on $46K$ avatars with a noise schedule optimized for triplanes, the resulting model is capable of generating 3D avatars with notably better details than previous methods and can generalize to in-the-wild portrait input. See~\cref{fig:teaser} for some examples.

# 293

PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations

Yang Zheng · Qingqing Zhao · Guandao Yang · Wang Yifan · Donglai Xiang · Florian Dubost · Dmitry Lagun · Thabo Beeler · Federico Tombari · Leonidas Guibas · Gordon Wetzstein

Modeling and rendering photorealistic avatars is of crucial importance in many applications. Existing methods that build a 3D avatar from visual observations, however, struggle to reconstruct clothed humans. We introduce PhysAvatar, a novel framework that combines inverse rendering with inverse physics to automatically estimate the shape and appearance of a human from multi-view video data along with the physical parameters of the fabric of their clothes. For this purpose, we adopt a mesh-aligned 4D Gaussian technique for spatio-temporal mesh tracking as well as a physically based inverse renderer to estimate the intrinsic material properties. PhysAvatar integrates a physics simulator to estimate the physical parameters of the garments using gradient-based optimization in a principled manner. These novel capabilities enable PhysAvatar to create high-quality novel-view renderings of avatars dressed in loose-fitting clothes under motions and lighting conditions not seen in the training data. This marks a significant advancement towards modeling photorealistic digital humans using physically based inverse rendering with physics in the loop.

# 292

Strong Double Blind

COMPOSE: Comprehensive Portrait Shadow Editing

Andrew Hou · Zhixin Shu · Xuaner Zhang · He Zhang · Yannick Hold-Geoffroy · Jae Shin Yoon · Xiaoming Liu

Existing portrait relighting methods struggle with precise control over facial shadows, particularly when faced with challenges such as handling hard shadows from directional light sources or adjusting shadows while remaining in harmony with existing lighting conditions. In many situations, completely altering input lighting is undesirable for portrait retouching applications: one may want to preserve some authenticity in the captured environment. Existing shadow editing methods typically restrict their application to just the facial region and often offer limited lighting control options, such as shadow softening or rotation. In this paper, we introduce COMPOSE: a novel shadow editing pipeline for human portraits, offering precise control over shadow attributes such as shape, intensity, and position, all while preserving the original environmental illumination of the portrait. This level of disentanglement and controllability is obtained thanks to a novel decomposition of the environment map representation into ambient light and an editable gaussian dominant light source. COMPOSE is a four-stage pipeline that consists of light estimation and editing, light diffusion, shadow synthesis, and finally shadow editing. We define facial shadows as the result of a dominant light source, encoded using our novel gaussian environment map representation. Utilizing an OLAT dataset, we have trained models to: (1) predict this light source representation from images, and (2) generate realistic shadows using this representation. We also demonstrate comprehensive and intuitive shadow editing with our pipeline. Through extensive quantitative and qualitative evaluations, we have demonstrated the robust capability of our system in shadow editing.

# 234

Strong Double Blind

GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval

Han Zhou · Wei Dong · Xiaohong Liu · Shuaicheng Liu · Xiongkuo Min · Guangtao Zhai · Jun Chen

The majority of existing Low-light Image Enhancement (LLIE) methods attempt to directly learn the mapping from Low-Light (LL) to Normal-Light (NL) images or benefit from semantic or illumination map to guide such learning. However, the inherent ill-posed nature of LLIE, coupled with the additional difficulty in retrieving the semantic information from significantly impaired inputs, compromises the performance of enhanced outputs, especially in extremely low-light environments. To address this issue, we present a new LLIE network via Generative LAtent feature based codebook REtrieval (GLARE) to improve the visibility of LL images. The codebook prior is derived from undegraded NL images using Vector Quantization (VQ) strategy. However, simply adopting codebook prior does not necessarily ensure the alignment between LL and NL features. Therefore, we develop an Invertible Latent Normalizing Flow (I-LNF) to generate features aligned with NL latent representations, guaranteeing the correct code matching in the codebook. In addition, a novel Adaptive Feature Transformation (AFT) module, which contains an Adaptive Mix-up Block (AMB) and a dual-decoder architecture, is devised to further elevate the fidelity while maintaining realistic details provided by codebook prior. Extensive experiments verify the superior performance of our GLARE. More importantly, the application on low-light object detection demonstrates the effectiveness of our method as a pre-processing tool in high-level vision tasks. Codes will be released upon publication.

# 233

Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging

Mahmoud Afifi · Zhenhua Hu · Liang Liang

High dynamic range (HDR) imaging involves capturing a series of frames of the same scene, each with different exposure settings, to broaden the dynamic range of light. This can be achieved through burst capturing or using staggered HDR sensors that capture long and short exposures simultaneously in the camera image signal processor (ISP). Within camera ISP pipeline, illuminant estimation is a crucial step aiming to estimate the color of the global illuminant in the scene. This estimation is used in camera ISP white-balance module to remove undesirable color cast in the final image. Despite the multiple frames captured in the HDR pipeline, conventional illuminant estimation methods often rely only on a single frame of the scene. In this paper, we explore leveraging information from frames captured with different exposure times. Specifically, we introduce a simple feature extracted from dual-exposure images to guide illuminant estimators, referred to as the dual-exposure feature (DEF). To validate the efficiency of DEF, we employed two illuminant estimators using the proposed DEF: 1) a multilayer perceptron network (MLP), referred to as exposure-based MLP (EMLP), and 2) a modified version of the convolutional color constancy (CCC) to integrate our DEF, that we call ECCC. Both EMLP and ECCC achieve promising results, in some cases surpassing prior methods that require hundreds of thousands or millions of parameters, with only a few hundred parameters for EMLP and a few thousand parameters for ECCC.

# 225

Strong Double Blind

Holodepth: Programmable Depth-Varying Projection via Computer-Generated Holography

Dorian Chan · Matthew O'Toole · Sizhuo Ma · Jian Wang

Typical projectors are designed to programmably display 2D content at a single depth. In this work, we explore how to engineer a depth-varying projector system that is capable of forming desired patterns at multiple depths. To this end, we leverage a holographic approach, but a naive implementation of such a system is limited in its depth programmability. Inspired by recent work in near-eye displays, we add a lens array to a holographic projector to maximize the depth variation of the projected content, for which we propose an optimization-driven calibration method. We demonstrate a number of applications using this system, including novel 3D interfaces for future wearables, privacy-preserving projection, depth sensing, and light curtains.

# 230

Strong Double Blind

BeNeRF:Neural Radiance Fields from a Single Blurry Image and Event Stream

Wenpu Li · Pian Wan · Peng Wang · Jinghang Li · Yi Zhou · Peidong Liu

Neural implicit representation of visual scenes has attracted a lot of attention in recent research of computer vision and graphics. Most prior methods focus on how to reconstruct 3D scene representation from a set of images. In this work, we demonstrate the possibility to recover the neural radiance fields (NeRF) from a single blurry image and its corresponding event stream. We model the camera motion with a cubic B-Spline in SE(3) space. Both the blurry image and the brightness change within a time interval, can then be synthesized from the 3D scene representation given the 6-DoF poses interpolated from the cubic B-Spline. Our method can jointly learn both the implicit neural scene representation and recover the camera motion by minimizing the differences between the synthesized data and the real measurements without pre-computed camera poses from COLMAP. We evaluate the proposed method with both synthetic and real datasets. The experimental results demonstrate that we are able to render view-consistent latent sharp images from the learned NeRF and bring a blurry image alive in high quality.

# 208

Strong Double Blind

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

Sungwon Hwang · Min-Jung Kim · Taewoong Kang · Jayeon Kang · Choo Jaegul

Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolated View Synthesis (EVS) problem by evaluating the reconstructions on views such as looking left, right or downwards with respect to training camera distributions. To improve rendering quality for EVS, we initialize our model by constructing dense LiDAR map, and propose to leverage prior scene knowledge such as surface normal estimator and large-scale diffusion model. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS. To the best of our knowledge, we are the first to address the EVS problem in urban scene reconstruction. We will release the code upon acceptance.

# 206

Strong Double Blind

G3R: Gradient Guided Generalizable Reconstruction

Yun Chen · Jingkang Wang · Ze Yang · Sivabalan Manivasagam · Raquel Urtasun

Large scale 3D scene reconstruction is important for applications such as virtual reality and simulation. Existing neural rendering approaches (e.g., NeRF, 3DGS) have achieved realistic reconstructions on large scenes, but optimize per scene, which is expensive and slow, and exhibit noticeable artifacts under large view changes due to overfitting. Generalizable approaches are fast, but primarily work for small scenes/objects and often produce lower quality rendering results. In this work, we introduce G3R, a generalizable reconstruction approach that can efficiently predict high-quality 3D scene representations for large scenes. We propose to learn a reconstruction network that takes the gradient feedback signals from differentiable rendering to iteratively update a 3D scene representation, combining the benefits of high photorealism from per-scene optimization with data-driven priors from fast feed-forward prediction methods. Experiments on large-scale urban-driving and drone datasets show that G3R accelerates the reconstruction process by at least 10x while achieving comparable or better realism compared to 3DGS, and also being more robust to large view changes.

# 214

Strong Double Blind

Efficient NeRF Optimization - Not All Samples Remain Equally Hard

Juuso Korhonen · Goutham Rangu · Hamed Rezazadegan Tavakoli · Juho Kannala

We propose an application of online hard sample mining for efficient training of Neural Radiance Fields (NeRF). NeRF models produce state-of-the-art quality for many 3D reconstruction and rendering tasks but require substantial computational resources. The encoding of the scene information within the NeRF network parameters necessitates stochastic sampling. We observe that during the training, a major part of the compute time and memory usage is spent on processing already learnt samples, which no longer affect the model update significantly. We identify the backward pass on the stochastic samples as the computational bottleneck during the optimization. We thus perform the first forward pass in inference mode as a relatively low-cost search for hard samples. This is followed by building the computational graph and updating the NeRF network parameters using only the hard samples. To demonstrate the effectiveness of the proposed approach, we apply our method to Instant-NGP, resulting in significant improvements of the view-synthesis quality over the baseline (1 dB improvement on average per training time, or 2x speedup to reach the same PSNR level) along with 40% memory savings coming from using only the hard samples to build the computational graph. As our method only interfaces with the network module, we expect it to be widely applicable.

# 229

BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling

Cheng Peng · Yutao Tang · Yifan Zhou · Nengyu Wang · Xijun Liu · Deming Li · Rama Chellappa

Recent efforts in using 3D Gaussians for scene reconstruction and novel view synthesis can achieve impressive results on curated benchmarks; however, images captured in real life are often blurry. In this work, we analyze the robustness of Gaussian-Splatting-based methods against various image blur, such as motion blur, defocus blur, downscaling blur, etc. Under these degradations, Gaussian-Splatting-based methods tend to overfit and produce worse results than Neural-Radiance-Field-based methods. To address this issue, we propose Blur Agnostic Gaussian Splatting (BAGS). BAGS introduces additional 2D modeling capacities such that a 3D-consistent and high quality scene can be reconstructed despite image-wise blur. Specifically, we model blur by estimating per-pixel convolution kernels from a Blur Proposal Network (BPN). BPN is designed to consider spatial, color, and depth variations of the scene to maximize modeling capacity. Additionally, BPN also proposes a quality-assessing mask, which indicates regions where blur occur. Finally, we introduce a coarse-to-fine kernel optimization scheme; this optimization scheme is fast and avoids sub-optimal solutions due to a sparse point cloud initialization, which often occurs when we apply Structure-from-Motion on blurry images. We demonstrate that BAGS achieves photorealistic renderings under various challenging blur conditions and imaging geometry, while significantly improving upon existing approaches.

# 219

Strong Double Blind

SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields

Yu Liu · Baoxiong Jia · Yixin Chen · Siyuan Huang

The ability to distill object-centric abstractions from intricate visual scenes underpins human-level generalization. Despite the significant progress in object-centric learning methods, learning object-centric representations in the 3D physical world remains a crucial challenge. In this work, we propose SlotLifter, a novel object-centric radiance model that aims to address the challenges of scene reconstruction and decomposition via slot-guided feature lifting. Such a design unites object-centric learning representations and image-based rendering methods, offering state-of-the-art performance in scene decomposition and novel-view synthesis on four challenging synthetic and four complex real-world datasets, outperforming existing 3D object-centric learning methods by a large margin. Through extensive ablative studies, we showcase the efficacy of each design in SlotLifter, shedding light on key insights for potential future directions.

# 226

Strong Double Blind

RS-NeRF: Neural Radiance Fields from Rolling Shutter Images

Muyao Niu · Tong Chen · Yifan Zhan · Zhuoxiao Li · Xiang Ji · zheng yinqiang

Neural Radiance Fields (NeRF) have become increasingly popular for their ability to reconstruct 3D scenes and create new viewpoints with outstanding quality. However, their effectiveness is hindered by rolling shutter (RS) effects commonly found in most camera systems. To solve this, we present RS-NeRF, a method designed to synthesize normal images from novel views using input with RS distortions. This involves a physical model that replicates the image formation process under RS conditions and jointly optimizes NeRF parameters and camera extrinsic for each image row. We further address the inherent shortcomings of the basic RS-NeRF model by delving into RS characteristics and developing algorithms to enhance its functionality. First, we impose a smoothness regularization to better estimate trajectories and improve synthesis quality, in line with the camera movement prior. We also identify and address a fundamental flaw in the vanilla RS model by introducing a multi-sampling algorithm. This new approach greatly improves the model's performance by comprehensively exploiting the RGB data across different rows for each intermediate camera pose. Through rigorous experimentation, we demonstrate that RS-NeRF surpasses previous methods in both synthetic and real-world scenarios, proving its ability to correct RS-related distortions effectively.

# 217

Strong Double Blind

Geometry Fidelity for Spherical Images

Anders Christensen · Nooshin Mojab · Khushman Patel · Karan Ahuja · Zeynep Akata · Ole Winther · Mar Gonzalez Franco · Andrea Colaco

Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. For image generation, this poses a full-stack problem: from smaller data-sets to access to tools that can evaluate quality of models. Specifically, we show that direct application of the established evaluation metric Fréchet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID, tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.

# 215

Strong Double Blind

CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance

Zhipeng Hu · Yongqiang Zhang · Chen Liu · Lincheng Li · Sida Peng · Xiaowei Zhou · Changjie Fan · Xin Yu

Differentiable surface rendering has significantly advanced 3D reconstruction. Existing surface rendering methods assume that the local surface is planar, and thus employ linear approximation based on the Singed Distance Field (SDF) values to predict the point on the surface. However, this assumption overlooks the inherently irregular and non-planar nature of object surfaces in the real world. Consequently, the approximate points tend to deviate from the zero-level set, affecting the fidelity of the reconstructed shape. In this paper, we propose a novel surface rendering method termed CPT-VR, which leverages the Closet Point Transform (CPT) and View and Reflection direction vectors to enhance the quality of reconstruction. Specifically, leveraging the physical property of CPT that accurately projects points near the surface onto the zero-level set, we correct the deviated points, thus achieving an accurate geometry representation. Based on our accurate geometry representation, incorporating the reflection vector into our method can facilitate the appearance modeling of specular regions. Moreover, to enable our method to no longer be dependent on any prior knowledge of the background, we present a background model to learn the background appearance. Compared to previous state-of-the-art methods, CPT-VR achieves better surface reconstruction quality, even for cases with complex structures and specular highlights.

# 221

MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Guoxing Sun · Rishabh Dabral · Pascal Fua · Christian Theobalt · Marc Habermann

Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields, which have shown great promise in capturing high-frequency appearance and geometry details in dense setups, perform poorly when naively supervising them on sparse camera views, as the field simply overfits to the sparse input views.To address this, we propose MetaCap, a method for efficient and high-quality geometry recovery and novel view synthesis given very sparse or even a single view of the human. Our key idea is to meta-learn radiance field weights solely from multi-view videos, which can serve as a prior when fine-tuning them on sparse imagery depicting the human. Due to the articulated structure of the human body and motion-induced surface deformations, learning such a prior is non-trivial. Therefore, we propose to meta-learn the field weights in a pose canonical space, which reduces the spatial feature range and makes feature learning more effective. Consequently, one can fine-tune our field parameters to quickly generalize to unseen poses, novel illumination conditions and novel and sparse (even monocular) camera views. For evaluating our method under different scenarios, we collect a new dataset, WildDynaCap, which contains subjects captured in, both, a dense camera dome and in-the-wild sparse camera rigs, and demonstrate superior results compared to recent state-of-the-art methods on both public dataset and WildDynaCap dataset.

# 218

Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis

Yuanhao Cai · Yixun Liang · Jiahao Wang · Angtian Wang · Yulun Zhang · Xiaokang Yang · Zongwei Zhou · Alan Yuille

X-ray is widely applied for transmission imaging due to its stronger penetration than natural light. When rendering novel view X-ray projections, existing methods mainly based on NeRF suffer from long training time and slow inference speed. In this paper, we propose a 3D Gaussian splatting-based framework, namely X-Gaussian, for X-ray novel view synthesis. Firstly, we redesign a radiative Gaussian point cloud model inspired by the isotropic nature of X-ray imaging. Our model excludes the influence of view direction when learning to predict the radiation intensity of 3D points. Based on this model, we develop a Differentiable Radiative Rasterization (DRR) with CUDA implementation. Secondly, we customize an Angle-pose Cuboid Uniform Initialization (ACUI) strategy that directly uses the parameters of the X-ray scanner to compute the camera information and then uniformly samples point positions within a cuboid enclosing the scanned object. Experiments show that our X-Gaussian outperforms state-of-the-art methods by 6.5 dB while enjoying less than 15% training time and over 73x inference speed. The application on sparse-view CT reconstruction also reveals the practical values of our method. Code and models will be released to the public.

# 203

GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time

Hao Li · Yuanyuan Gao · Dingwen Zhang · Chenming Wu · YALUN DAI · Chen Zhao · Haocheng Feng · Errui Ding · Jingdong Wang · Junwei Han

This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism, the proposed framework can inherently estimate robust relative pose information from the image observations and thus primarily alleviate the requirement of real camera poses. Moreover, we implement a deferred back-propagation mechanism that enables high-resolution training and inference, overcoming the resolution constraints of previous methods. To enhance the speed and efficiency, we further introduce a progressive Gaussian cache module that dynamically adjusts during training and inference. As the first pose-free generalizable 3D-GS framework, GGRt achieves inference at $\ge$ 5 FPS and real-time rendering at $\ge$ 100 FPS. Through extensive experimentation, we demonstrate that our method outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. It can also approach the real pose-based 3D-GS methods. Our contributions provide a significant leap forward for the integration of computer vision and computer graphics into practical applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open datasets and enabling real-time rendering for immersive experiences.

# 291

Strong Double Blind

Neural graphics texture compression supporting random access

Farzad Farhadzadeh · Qiqi Hou · Hoang Le · Amir Said · Randall R Rauwendaal · Alex Bourd · Fatih Porikli

Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression. First, texture compression requires on-demand and real-time decoding with random access during parallel rendering (e.g. block texture decompression on GPUs). Additionally, NIC does not support multi-resolution reconstruction (mip-levels), nor does it have the ability to efficiently jointly compress different sets of texture channels. In this work, we introduce a novel approach to texture set compression that integrates traditional GPU texture representation and NIC techniques, designed to enable random access and support many-channel texture sets. To achieve this goal, we propose an asymmetric auto-encoder framework that employs a convolutional encoder to capture detailed information in a bottleneck-latent space, and at decoder side we utilize a fully connected network, whose inputs are sampled latent features plus positional information, for a given texture coordinate and mip level. This latent data is defined to enable simplified access to multi-resolution data by simply changing the scanning strides. Experimental results demonstrate that this approach provides much better results than conventional texture compression, and significant improvement over the latest method using neural networks.

# 220

GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views

Yaniv Wolf · Amit Bracha · Ron Kimmel

Recently, 3D Gaussian Splatting (3DGS) has emerged as an efficient approach for accurately representing scenes. However, despite its superior novel view synthesis capabilities, extracting the geometry of the scene directly from the Gaussian properties remains a challenge, as those are optimized based on a photometric loss. While some concurrent models have tried adding geometric constraints during the Gaussian optimization process, they still produce noisy, unrealistic surfaces. We propose a novel approach for bridging the gap between the noisy 3DGS representation and the smooth 3D mesh representation, by injecting real-world knowledge into the depth extraction process. Instead of extracting the geometry of the scene directly from the Gaussian properties, we instead extract the geometry through a pre-trained stereo-matching model. We render stereo-aligned pairs of images corresponding to the original training poses, feed the pairs into a stereo model to get a depth profile, and finally fuse all of the profiles together to get a single mesh. The resulting reconstruction is smoother, more accurate and shows more intricate details compared to other methods for surface reconstruction from Gaussian Splatting, while only requiring a small overhead on top of the fairly short 3DGS optimization process. We performed extensive testing of the proposed method on in-the-wild scenes, obtained using a smartphone, showcasing its superior reconstruction abilities. Additionally, we tested the method on the Tanks and Temples and DTU benchmarks, achieving state-of-the-art results.

# 224

A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis

Kai Katsumata · Duc Minh Vo · Hideki Nakayama

3D Gaussian Splatting (3DGS) has shown remarkable success in synthesizing novel views given multiple views of a static scene. Yet, 3DGS faces challenges when applied to dynamic scenes because 3D Gaussian parameters need to be updated per timestep, requiring a large amount of memory and at least a dozen observations per timestep. To address these limitations, we present a compact dynamic 3D Gaussian representation that models positions and rotations as functions of time with a few parameter approximations while keeping other properties of 3DGS including scale, color and opacity invariant. Our method can dramatically reduce memory usage and relax a strict multi-view assumption. In our experiments on monocular and multi-view scenarios, we show that our method not only matches state-of-the-art methods, often linked with slower rendering speeds, in terms of high rendering quality but also significantly surpasses them by achieving a rendering speed of 118 frames per second (FPS) at a resolution of 1,352x1,014 on a single GPU.

# 201

Strong Double Blind

Click-Gaussian: Interactive Segmentation to Any 3D Gaussians

Seokhun Choi · Hyeonseop Song · Jaechul Kim · Taehyeong Kim · Hoseok Do

Interactive segmentation of 3D Gaussians opens a great opportunity for real-time manipulation of 3D scenes thanks to the real-time rendering capability of 3D Gaussian Splatting. However, the current methods suffer from time-consuming post-processing to deal with noisy segmentation output. Also, they struggle to provide detailed segmentation, which is important for fine-granular manipulation of 3D scenes. In this study, we propose Click-Gaussian, which learns distinguishable feature fields of two-level granularity, facilitating segmentation without time-consuming post-processing.We delve into challenges stemming from inconsistently learned feature fields resulting from 2D segmentation obtained independently from a 3D scene. 3D segmentation accuracy deteriorates when 2D segmentation results across the views, primary cues for 3D segmentation, are in conflict. To overcome these issues, we propose Global Feature-guided Learning (GFL). GFL constructs the clusters of global feature candidates from noisy 2D segments across the views, which smooths out noises when learning the features of 3D Gaussians. Our method runs in 10ms per click, 15 to 130 times as fast as the previous methods, while also significantly improving segmentation accuracy.

# 200

Strong Double Blind

McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction

Daxuan Ren · Hezi Shi · Jianmin Zheng · Jianfei Cai

Iso-surface extraction from an implicit field is a fundamental process in various applications of computer vision and graphics. When dealing with geometric shapes with complicated geometric details, many existing algorithms suffer from high computational costs and memory usage. This paper proposes McGrids, a novel approach to improve the efficiency of iso-surface extraction. The key idea is to construct adaptive grids for iso-surface extraction rather than using a simple uniform grid as prior art does. Specifically, we formulate the problem of constructing adaptive grids as a probability sampling problem, which is then solved by Monte Carlo process. We demonstrate McGrids' capability with extensive experiments from both analytical SDFs computed from surface meshes and learned implicit fields from real multiview images. The experiment results show that our McGrids can significantly reduce the number of implicit field queries, resulting in significant memory reduction, while producing high-quality meshes with rich geometric details.

# 207

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

Christopher Wewer · Kevin Raj · Eddy Ilg · Bernt Schiele · Jan Eric Lenssen

We present latentSplat, a method to predict variational Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not enable fast inference of high resolution novel views due to slow volume rendering, or are limited to interpolation of close input views, even in simpler settings with a single central object, where 360-degree generalization is possible. In this work, we combine a regression-based approach with a generative model, moving towards both of these capabilities within the same method, trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient Gaussian splatting and a fast, generative decoder network. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.

# 223

Strong Double Blind

Non-parametric Sensor Noise Modeling and Synthesis

Ali Mosleh · Luxi Zhao · Atin Vikram Singh · Jaeduk Han · Abhijith Punnappurath · Marcus A Brubaker · Jihwan Choe · Michael S Brown

We introduce a novel non-parametric sensor noise model that directly constructs probability mass functions per intensity level from captured images. We show that our noise model provides a more accurate fit to real sensor noise than existing models. We detail the capture procedure for deriving our non-parametric noise model and introduce an interpolation method that reduces the number of ISOs levels that need to be captured. In addition, we propose a method to synthesize noise on existing noisy images when noise-free images are not available. Our noise model is straightforward to calibrate and provides notable improvements over competing noise models on downstream tasks.

# 213

UpFusion: Novel View Diffusion from Unposed Sparse View Observations

Bharath Raj Nagoor Kani · Hsin-Ying Lee · Sergey Tulyakov · Shubham Tulsiani

We propose UpFusion, a system that can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images without corresponding pose information. Current sparse-view 3D inference methods typically rely on camera poses to geometrically aggregate information from input views, but are not robust in-the-wild when such information is unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by learning to implicitly leverage the available images as context in a conditional generative model for synthesizing novel views. We incorporate two complementary forms of conditioning into diffusion models for leveraging the input views: a) via inferring query-view aligned features using a scene-level transformer, b) via intermediate attentional layers that can directly observe the input image tokens. We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google Scanned Objects datasets and demonstrate the benefits of our method over pose-reliant sparse-view methods as well as single-view methods that cannot leverage additional views. Finally, we also show that our learned model can generalize beyond the training categories and even allow reconstruction from self-captured images of generic objects in-the-wild.

# 199

MVDD: Multi-View Depth Diffusion Models

Zhen Wang · Qiangeng Xu · Feitong Tan · Menglei Chai · Shichen Liu · Rohit Pandey · Sean Fanello · Achuta Kadambi · Yinda Zhang

Denoising diffusion models have demonstrated outstanding results in 2D image generation, yet it remains a challenge to replicate its success in 3D shape generation. In this paper, we propose leveraging multi-view depth, which represents complex 3D shapes in a 2D data format that is easy to denoise. We pair this representation with a diffusion model, MVDD, that is capable of generating high-quality dense point clouds with 20K+ points with fine-grained details. To enforce 3D consistency in multi-view depth, we introduce an epipolar line segment attention that conditions the denoising step for a view on its neighboring views. Additionally, a depth fusion module is incorporated into diffusion steps to further ensure the alignment of depth maps. When augmented with surface reconstruction, MVDD can also produce high-quality 3D meshes. Furthermore, MVDD stands out in other tasks such as depth completion, and can serve as a 3D prior, significantly boosting many downstream tasks, such as GAN inversion. State-of-the-art results from extensive experiments demonstrate MVDD's excellent ability in 3D shape generation, depth completion, and its potential as a 3D prior for downstream tasks.

# 155

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Jiaxiang Tang · Zhaoxi Chen · Xiaokang Chen · Tengfei Wang · Gang Zeng · Ziwei Liu

3D content creation has achieved significant progress in terms of both quality and speed. Although current feed-forward models can produce 3D objects in seconds, their resolution is constrained by the intensive computation required during training. In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. Our key insights are two-fold: 1) 3D Representation: We propose multi-view Gaussian features as an efficient yet powerful representation, which can then be fused together for differentiable rendering. 2) 3D Backbone: We present an asymmetric U-Net as a high-throughput backbone operating on multi-view images, which can be produced from text or single-view image input by leveraging multi-view diffusion models. Extensive experiments demonstrate the high fidelity and efficiency of our approach. Notably, we maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.

# 322

Hypernetworks for Generalizable BRDF Representation

Fazilet Gokbudak · Alejandro Sztrajman · Chenliang Zhou · Fangcheng Zhong · Rafal Mantiuk · Cengiz Oztireli

In this paper, we introduce a technique to estimate measured BRDFs from a sparse set of samples. Our approach offers accurate BRDF reconstructions that are generalizable to new materials. This opens the door to BDRF reconstructions from a variety of data sources. The success of our approach relies on the ability of hypernetworks to generate a robust representation of BRDFs and a set encoder that allows us to feed inputs of different sizes to the architecture. The set encoder and the hypernetwork also enable the compression of densely sampled BRDFs. We evaluate our technique both qualitatively and quantitatively on the well-known MERL dataset of 100 isotropic materials. Our approach accurately 1) estimates the BRDFs of unseen materials even for an extremely sparse sampling, 2) compresses the measured BRDFs into very small embeddings, e.g., 7D.

# 205

Strong Double Blind

High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding

Qi Zuo · Xiaodong Gu · Yuan Dong · Zhengyi Zhao · Weihao Yuan · Lingteng Qiu · Liefeng Bo · Zilong Dong

3D vision is inherently characterized by sparse spatial structures, which propels the necessity for an efficient paradigm tailored to 3D generation. Another discrepancy is the amount of training data, which undeniably affects generalization if we only use limited 3D data. To solve these, we design a 3D generation framework that maintains most of the building blocks of StableDiffusion with minimal adaptations for textured shape generation. We design a Sparse Encoding Module for details preservation and an Adversarial Decoding Module for better shape recovery. Moreover, we clean up data and build a benchmark on the biggest 3D dataset (Objaverse). We drop the concept of `specific class' and treat the 3D Textured Shapes Generation as an open-vocabulary problem. We first validate our network design on ShapeNetV2 with 55K samples on single-class unconditional generation and multi-class conditional generation tasks. Then we report metrics on processed Objaverse-clean with 200K samples on the image conditional generation task. Extensive experiments demonstrate our proposal outperforms SOTA methods and takes a further step towards open-vocabulary 3D generation.

# 227

Strong Double Blind

Structured-NeRF: Hierarchical Scene Graph with Neural Representation

Zhide Zhong · Jiakai Cao · songen gu · Sirui Xie · Liyi Luo · HAO ZHAO · Guyue Zhou · Haoang Li · Zike Yan

We present Structured Neural Radiance Field (Structured-NeRF) for indoor scene representaion based on a novel hierarchical scene graph structure to organize the neural radiance field. Existing object-centric methods focus only on the inherent characteristics of objects, while overlooking the semantic and physical relationships between them. Our scene graph is adept at managing the complex real-world correlation between objects within a scene, enabling functionality beyond novel view synthesis, such as scene re-arrangement. Based on the hierarchical structure, we introduce the optimization strategy based on semantic and physical relationships, thus simplifying the operations involved in scene editing and ensuring both efficiency and accuracy. Moreover, we conduct shadow rendering on objects to further intensify the realism of the rendered images. Experimental results demonstrate our structured representation not only achieves state-of-the-art (SOTA) performance in object-level and scene-level rendering, but also advances downstream applications in union with LLM/VLM, such as automatic and instruction/image conditioned scene re-arrangement, thereby extending the NeRF to interactive editing conveniently and controllably.

# 209

3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing

Haoran Li · Long Ma · Haolin Shi · Yanbin Hao · Yong Liao · Lechao Cheng · Peng Yuan Zhou

The current GAN inversion methods typically can only edit the appearance and shape of a single object and background while overlooking spatial information. In this work, we propose a 3D editing framework, 3D-GOI, to enable multifaceted editing of affine information (scale, translation, and rotation) on multiple objects. 3D-GOI realizes the complex editing function by inverting the abundance of attribute codes (object shape/appearance/scale/rotation/translation, background shape/appearance, and camera pose) controlled by GIRAFFE, a renowned 3D GAN. Accurately inverting all the codes is challenging, 3D-GOI solves this challenge following three main steps. First, we segment the objects and the background in a multi-object image. Second, we use a custom Neural Inversion Encoder to obtain coarse codes of each object. Finally, we use a round-robin optimization algorithm to get precise codes to reconstruct the image. To the best of our knowledge, 3D-GOI is the first framework to enable multifaceted editing on multiple objects. Both qualitative and quantitative experiments demonstrate that 3D-GOI holds immense potential for flexible, multifaceted editing in complex multi-object scenes.Our project code will be made publicly available.

# 302

Free-Editor: Zero-shot Text-driven 3D Scene Editing

Md Nazmul Karim · Hasan Iqbal · Umar Khalid · Chen Chen · Jing Hua

Text-to-Image (T2I) diffusion models have gained popularity recently due to their multipurpose and easy-to-use nature, e.g. image and video generation as well as editing. However, training a diffusion model specifically for 3D scene editing is not straightforward due to the lack of large-scale datasets. To date, editing 3D scenes requires either re-training the model to adapt to various 3D edited scenes or design-specific methods for each special editing type. Furthermore, state-of-the-art (SOTA) methods require multiple synchronized edited images from the same scene to facilitate the scene editing. Due to the current limitations of T2I models, it is very challenging to apply consistent editing effects to multiple images, i.e. multi-view inconsistency in editing. This in turn compromises the desired 3D scene editing performance if these images are used. In our work, we propose a novel training-free 3D scene editing technique, \textsc{Free-Editor}, which allows users to edit 3D scenes without further re-training the model during test time. Our proposed method successfully avoids the \emph{multi-view style inconsistency} issue in SOTA methods with the help of a ``single-view editing" scheme. Specifically, we show that editing a particular 3D scene can be performed by only modifying a single view. To this end, we introduce an \emph{Edit Transformer} that enforces intra-view consistency and inter-view style transfer by utilizing self-view and cross-view attention, respectively. Since it is no longer required to re-train the model and edit every view in a scene, the editing time, as well as memory resources, are reduced significantly, e.g., the runtime being $\sim \textbf{20} \times$ faster than SOTA. We have conducted extensive experiments on a wide range of benchmark datasets and achieved diverse editing capabilities with our proposed technique.

# 296

Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing

Tian-Xing Xu · WENBO HU · Yu-Kun Lai · Ying Shan · Song-Hai Zhang

3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering. However, it couples the appearance and geometry of the scene within the Gaussian attributes, which hinders the flexibility of editing operations, such as texture swapping. To address this issue, we propose a novel approach, namely Texture-GS, to disentangle the appearance from the geometry by representing it as a 2D texture, thereby facilitating appearance editing. Technically, the disentanglement is achieved by our proposed texture mapping module, which consists of a UV mapping MLP to learn the UV coordinates for the 3D Gaussian centers, a local Taylor expansion of the MLP to efficiently approximate the UV coordinates for the ray-Gaussian intersections, and a learnable texture to capture the fine-grained appearance. Extensive experiments on the DTU dataset demonstrate that our method not only facilitates high-fidelity appearance editing but also achieves real-time rendering on consumer-level devices, e.g. a single RTX 2080 Ti GPU.

# 307

Strong Double Blind

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

Shang Liu · Chaohui Yu · Chenjie Cao · Wen Qian · Fan Wang

Recent research on texture synthesis for 3D shapes benefits a lot from dramatically developed 2D text-to-image diffusion models, including inpainting-based and optimization-based approaches. However, these methods ignore the modal gap between the 2D diffusion model and 3D objects, which primarily render 3D objects into 2D images and texture each image separately. In this paper, we revisit the texture synthesis and propose a Variance alignment based 3D-2D Collaborative Denoising framework, dubbed VCD-Texture, to address these issues. Formally, we first unify both 2D and 3D latent feature learning in diffusion self-attention modules with re-projected 3D attention receptive fields. Subsequently, the denoised multi-view 2D latent features are aggregated into 3D space and then rasterized back to formulate more consistent 2D predictions. However, the rasterization process suffers from an intractable variance bias, which is theoretically addressed by the proposed variance alignment, achieving high-fidelity texture synthesis. Moreover, we present an inpainting refinement to further improve the details with conflicting regions. Notably, there is not a publicly available benchmark to evaluate texture synthesis, which hinders its development. Thus we construct a new evaluation set built upon three open-source 3D datasets and propose to use four metrics to thoroughly validate the texturing performance. Comprehensive experiments demonstrate that VCD-Texture achieves superior performance against other counterparts.

# 323

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

Zexiang Liu · Yangguang Li · Youtian Lin · Xin Yu · Sida Peng · Yanpei Cao · Qi Xiaojuan · Xiaoshui Huang · Ding Liang · Wanli Ouyang

Recent advancements in text-to-3D generation technology have significantly advanced the conversion of textual descriptions into imaginative well-geometrical and finely textured 3D objects. Despite these developments, a prevalent limitation arises from the use of RGB data in diffusion or reconstruction models, which often results in models with inherent lighting and shadows effects that detract from their realism, thereby limiting their usability in applications that demand accurate relighting capabilities. To bridge this gap, we present UniDream, a text-to-3D generation framework by incorporating unified diffusion priors. Our approach consists of three main components: (1) a dual-phase training process to get albedo-normal aligned multi-view diffusion and reconstruction models, (2) a progressive generation procedure for geometry and albedo-textures based on Score Distillation Sample (SDS) using the trained reconstruction and diffusion models, and (3) an innovative application of SDS for finalizing PBR generation while keeping a fixed albedo based on Stable Diffusion model. Extensive evaluations demonstrate that UniDream surpasses existing methods in generating 3D objects with clearer albedo textures, smoother surfaces, enhanced realism, and superior relighting capabilities.

# 334

Strong Double Blind

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

Zhiyuan MA · Yuxiang WEI · Yabin Zhang · Xiangyu Zhu · Zhen Lei · Yabin Zhang

By leveraging the text-to-image diffusion prior, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model's comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments using different text-to-3D architectures, including Hyper-iNGP and 3DConv-Net. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus. Code will be released.

# 318

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

Junkai Yan · Yipeng Gao · Qize Yang · Xihan Wei · Xuansong Xie · Ancong Wu · WEISHI ZHENG

Text-to-3D generation, which synthesizes 3D assets according to an overall text description, has significantly progressed. However, a challenge arises when the specific appearances need customizing at designated viewpoints but referring solely to the overall description for generating 3D objects. For instance, ambiguity easily occurs when producing a T-shirt with distinct patterns on its front and back using a single overall text guidance. In this work, we propose DreamView, a text-to-image approach enabling multi-view customization while maintaining overall consistency by adaptively injecting the view-specific and overall text guidance through a collaborative text guidance injection module, which can also be lifted to 3D generation via score distillation sampling. DreamView is trained with large-scale rendered multi-view images and their corresponding view-specific texts to learn to balance the separate content manipulation in each view and the global consistency of the overall object, resulting in a dual achievement of customization and consistency. Consequently, DreamView empowers artists to design 3D objects creatively, fostering the creation of more innovative and diverse 3D assets.

# 342

Strong Double Blind

SceneTeller: Language-to-3D Scene Generation

Basak Melis Ocal · Maxim Tatarchenko · Sezer Karaoglu · Theo Gevers

Designing high-quality indoor 3D scenes is important in many practical applications, such as room planning or game development. Conventionally, this has been a time-consuming process which requires both artistic skill and familiarity with professional software, making it hardly accessible for layman users. However, recent advances in generative AI have established solid foundation for democratizing 3D design. In this paper we propose a pioneering approach for text-based 3D room design. Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it. With an additional text prompt the users can change the appearance of the entire scene or of individual objects in it. Built using in-context learning, CAD model retrieval and 3D-Gaussian-Splatting-based stylization, our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices.

# 340

Text to Layer-wise 3D Clothed Human Generation

Junting Dong · Qi Fang · Zehuan Huang · Xudong XU · Jingbo Wang · Sida Peng · Bo Dai

This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothes editing and meanwhile lose fine-grained control over the whole generation process(e.g., specify the order of inside and outside of clothes). To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothes disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothes generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothes model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves better 3D clothed human generation than the holistic modeling method while also supporting cloth editing applications such as virtual try-on. \keywords{Text-to-3D generation \and Clothed human generation}

# 341

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

Wenyu Li · Binghui Chen · Yifeng Geng · Xuansong Xie · Wangmeng Zuo

With the development of the large-scale diffusion model, Artificial Intelligence Generated Content (AIGC) techniques are popular recently. However, how to truly make it serve our daily lives remains an open question. To this end, in this paper, we focus on employing AIGC techniques in one filed of E-commerce marketing, \ie, generating hyper-realistic advertising images for displaying user-specified shoes by human. Specifically, we propose a shoe-wearing system, called \textbf{ShoeModel}, to generate plausible images of human legs interacting with the given shoes. It consists of three modules: (1) shoe wearable-area detection module (WD), (2) leg-pose synthesis module (LpS) and the final (3) shoe-wearing image generation module (SW). Them three are performed in ordered stages. Compared to baselines, our ShoeModel is shown to generalize better to different type of shoes and has ability of keeping the ID-consistency of the given shoes, as well as automatically producing reasonable interactions with human. Extensive experiments show the effectiveness of our proposed shoe-wearing system. Figure \ref{fig_intro} shows the input and output examples of our ShoeModel.

# 336

Strong Double Blind

D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

Zhaotong Yang · Zicheng Jiang · Xinzhe Li · Huiyu Zhou · Junyu Dong · Huaidong Zhang · YONG DU

In this paper, we introduce D4-VTON, a novel solution for image-based virtual try-on that seamlessly replaces a person's original garments with target garments while preserving pose and identity. We address challenges encountered in prior works, such as inaccurate clothing parsers causing artifacts and failing to ensure faithful semantic alignment. Additionally, we tackle the difficulties faced by diffusion models in solving this specific task, which involves the composite tasks of inpainting and denoising. To achieve these goals, we employ two self-contained technologies: Firstly, we propose a Dynamic Group Warping Module (DGWM) to disentangle semantic information and guide warping flows for authentic warped garments. Secondly, we deploy a Differential Noise Restoration Process (DNRP) to capture differential noise between incomplete try-on input and its complete counterpart, facilitating lifelike final results with negligible overhead. Extensive experiments demonstrate that D4-VTON surpasses state-of-the-art methods both quantitatively and qualitatively by a significant margin, showcasing its superiority in generating realistic images and precise semantic alignment.

# 212

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

Yutong Chen · Yifan Zhan · Zhihang Zhong · Wei Wang · Xiao Sun · Yu Qiao · zheng yinqiang

Neural rendering techniques have significantly advanced 3D human body modeling. However, previous approaches often overlook dynamics induced by factors such as motion inertia, leading to challenges in scenarios like abrupt stops after rotation, where the pose remains static while the appearance changes. This limitation arises from reliance on a single pose as conditional input, resulting in ambiguity in mapping one pose to multiple appearances. In this study, we elucidate that variations in human appearance depend not only on the current frame's pose condition but also on past pose states. Therefore, we introduce Dyco, a novel method utilizing the delta pose sequence representation for non-rigid deformations and canonical space to effectively model temporal appearance variations. To prevent a decrease in the model's generalization ability to novel poses, we further propose low-dimensional global context to reduce unnecessary inter-body part dependencies and a quantization operation to mitigate overfitting of the delta pose sequence by the model. To validate the effectiveness of our approach, we collected a novel dataset named I3D-Human, with a focus on capturing temporal changes in clothing appearance under approximate poses. Through extensive experiments on both I3D-Human and existing datasets, our approach demonstrates superior qualitative and quantitative performance. In addition, our inertia-aware 3D human method can unprecedentedly simulate appearance changes caused by inertia at different velocities.

# 250

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Keqiang Sun · Dori Litvak · Yunzhi Zhang · Hongsheng LI · Jiajun Wu · Shangzhe Wu

We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.

# 216

Strong Double Blind

Temporal Residual Jacobians for Rig-free Motion Transfer

Sanjeev Muralikrishnan · Niladri Shekhar Dutt · Siddhartha Chaudhuri · Noam Aigerman · Vladimir Kim · Matthew Fisher · Niloy Mitra

We introduce Temporal Residual Jacobians as a novel representation to enable data-driven motion transfer. Our approach does not assume access to any rigging or intermediate shape keyframes, produces geometrically and temporally consistent motions, and can be used to transfer long motion sequences. Central to our approach are two dedicated neural networks that individually predict the local geometric and temporal changes that are subsequently integrated, spatially and temporally, to produce the final animated meshes. The two networks are jointly trained, complement each other in producing spatial and temporal signals, and are supervised directly with 3D positional information. During inference, in the absence of keyframes, our method essentially solves a motion extrapolation problem. We test our setup on diverse meshes (synthetic and scanned shapes) to demonstrate its effectiveness in generating realistic and natural-looking animations on unseen body shapes.

# 338

PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation

Jaejung Seol · Seojun Kim · Jaejun Yoo

Visual layout plays a critical role in graphic design fields such as advertising, posters, and web UI design. The recent trend towards content-aware layout generation through generative models has shown promise, yet it often overlooks the semantic intricacies of layout design by treating it as a simple numerical optimization. To bridge this gap, we introduce PosterLlama, a network designed for generating visually and textually coherent layouts by reformatting layout elements into HTML code and leveraging the rich design knowledge embedded within large language models. Furthermore, we enhance the robustness of our model with a unique depth-based poster augmentation strategy. This ensures our generated layouts remain semantically rich but also visually appealing, even with limited data. Our extensive evaluations across several benchmarks demonstrate that PosterLlama outperforms existing methods in producing authentic and content-aware layouts. It supports an unparalleled range of conditions, including but not limited to unconditional layout generation, element conditional layout generation, layout completion, among others, serving as a highly versatile user manipulation tool.

# 343

Strong Double Blind

GroundUp: Rapid Sketch-Based 3D City Massing

Gizem Esra Unlu · Mohamed Sayed · Yulia Gryaditskaya · Gabriel Brostow

We propose GroundUp, the first sketch-based ideation tool for 3D city massing of urban areas. Architects start by making designs that create balance between constructed masses and open spaces, but existing software is aimed at late-stage precise building details. GroundUp aims to allow architects to easily iterate earlier, as they try out and share their ideas, moving back and forth between 2D sketching and 3D. Inspired by feedback from architects and existing workflows, our user sketches the initial footprints of multiple buildings in a top-down view. Our proposed interface then enables them to preview and sketch in a second, perspective view. All the while, they can visualize and navigate a 3D model, inferred to match the two sketches, leading them to change or refine the designs. The model driving our interface has two main components. First, we propose a tailored depth prediction network for perspective sketches, which exploits the top-down view cues. Second, we integrate a state-of-the-art latent diffusion model to enable the generation of plausible-looking buildings. This diffusion model uses a heightfield representation, which allows us to construct the city ''from the ground up''. It is conditioned on the top-down sketch and partial depth cues derived from the perspective sketch. We will release the code, datasets, and interface upon acceptance.

# 173

DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching

Paul Roetzer · Ahmed Abbas · Dongliang Cao · Florian Bernard · Paul Swoboda

In this work we propose to combine the advantages of learning-based and combinatorial formalisms for 3D shape matching. While learning-based methods lead to state-of-the-art matching performance, they do not ensure geometric consistency, so that obtained matchings are locally non-smooth. On the contrary, axiomatic, optimisation-based methods allow to take geometric consistency into account by explicitly constraining the space of valid matchings. However, existing axiomatic formalisms do not scale to practically relevant problem sizes, and require user input for the initialisation of non-convex optimisation problems. We work towards closing this gap by proposing a novel combinatorial solver that combines a unique set of favourable properties: our approach (i) is initialisation free, (ii) is massively parallelisable and powered by a quasi-Newton method, (iii) provides optimality gaps, and (iv) delivers improved matching quality with decreased runtime and globally optimal results for many instances.

# 167

Strong Double Blind

FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation

Honghao Xu · Juzhan Xu · Zeyu Huang · Pengfei Xu · Hui Huang · Ruizhen Hu

In this paper, we introduce a novel method called FRI-Net for 2D floorplan reconstruction from 3D point cloud. Existing methods typically rely on corner regression or box regression, which lack consideration for the global shapes of rooms. To address these issues, we propose a novel approach using a room-wise implicit representation to characterize the shapes of rooms in floorplans. By incorporating geometric priors of room layouts in floorplans into our training strategy, the generated room polygons are more geometrically regular. We conducted experiments on two challenging datasets, Structured3D and SceneCAD. Our method demonstrates improved performance compared to state-of-the-art methods, validating the effectiveness of our proposed representation for floorplan reconstruction.

# 163

PointNeRF++: A multi-scale, point-based Neural Radiance Field

Weiwei Sun · Eduard Trulls · Yang-Che Tseng · Sneha Sambandam · Gopal Sharma · Andrea Tagliasacchi · Kwang Moo Yi

Point clouds offer an attractive source of information to complement images in neural scene representations, especially when few images are available. Neural rendering methods based on point clouds do exist, but they do not perform well when the point cloud quality is low--e.g., sparse or incomplete, which is often the case with real-world data. We overcome these problems with a simple representation that aggregates point clouds at multiple scale levels with sparse voxel grids at different resolutions. To deal with point cloud sparsity, we average across multiple scale levels---but only among those that are valid, \ie, that have enough neighboring points in proximity to the ray of a pixel. To help model areas without points, we add a global voxel at the coarsest scale, thus unifying ``classical'' and point-based NeRF formulations. We validate our method on the NeRF Synthetic, ScanNet, and KITTI-360 datasets, outperforming the state of the art, with a significant gap compared to other NeRF-based methods, especially on more challenging scenes.

# 164

Strong Double Blind

Continuous SO(3) Equivariant Convolution for 3D Point Cloud Analysis

Jaein Kim · HEE BIN YOO · Dong-Sig Han · Yeon-Ji Song · Byoung-Tak Zhang

The inherent richness of geometric information in point cloud underscores the necessity of leveraging group equivariance, as preserving the topological structure of the point cloud up to the feature space provides an intuitive inductive bias for solving problems in 3D space. Since manifesting the symmetry by means of model architecture has an advantage over the dependence on the augmentation, it has been a crucial research topic in the point cloud field. However, existing methods have limitations in the non-continuity of groups or the complex architecture causing computational inefficiency. In this paper, we propose CSEConv: a novel point convolution layer equivariant under continuous SO(3) actions. Its structure is founded on the framework of group theory, realizing the convolution module defined on a sphere. Implementing its filters to be explicit, continuous, and rigorously equivariant functions defined upon the double coset space is the distinctive factor which makes our method more scalable than previous approaches. From the classification experiments on synthetic and real-world point cloud datasets, our method achieves the best accuracy, to the best of our knowledge, amidst point-based models equivariant against continuous rotation group.

# 330

Strong Double Blind

UMERegRobust – Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration

Yuval Haitman · Amit Efraim · Joseph Francos

Point cloud registration is a critical component in many vision-based applications, such as perception for autonomous systems. The registration of point cloud observations on a rigid object, or scene, amounts to estimating the rigid transformation relating them. However, in practical scenarios, these observations are often characterized by partial overlap as a result of being acquired from different viewpoints, as well as by different sampling patterns. In this paper, we adopt the Universal Manifold Embedding (UME) framework for the estimation of rigid transformations and extend it, so that it can accommodate scenarios involving partial overlap and differently sampled point clouds. UME is a methodology designed for mapping observations of the same object, related by rigid transformations, into a single low-dimensional linear subspace. This process yields a transformation-invariant representation of the observations, with its matrix form representation being covariant with the transformation. We extend the UME framework by introducing a UME-compatible feature extraction method augmented with a unique UME contrastive loss and a sampling equalizer. These components are integrated into a comprehensive and robust registration pipeline, named UMERegRobust. We propose the RotKITTI registration benchmark, specifically tailored to evaluate registration methods for scenarios involving large rotations. UMERegRobust achieves better than state-of-the-art performance on the KITTI benchmark, especially when strict precision of (1 deg, 10cm) is considered (with an average gain of +9%), and notably outperform SOTA methods on the RotKITTI benchmark (with +45% gain compared the most recent SOTA method.

# 165

FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation

Chenliang Zhou · Fangcheng Zhong · Param Hanji · Zhilin Guo · Kyle Thomas Fogarty · Alejandro Sztrajman · Hongyun Gao · Cengiz Oztireli

We propose FrePolad: frequency-rectified point latent diffusion, a point cloud generation pipeline integrating a variational autoencoder (VAE) with a denoising diffusion probabilistic model (DDPM) modeling the latent distribution. FrePolad simultaneously achieves high quality, diversity, and flexibility in point cloud cardinality for generation tasks while maintaining high computational efficiency. The improvement in generation quality and diversity is achieved through (1) a novel frequency rectification module via spherical harmonics designed to retain high-frequency content while learning the point cloud distribution; and (2) a latent DDPM to learn the regularized yet complex latent distribution. In addition, FrePolad supports variable point cloud cardinality by formulating the sampling of points as conditional distributions over a latent shape distribution. Finally, the low-dimensional latent space encoded by the VAE contributes to FrePolad's fast and scalable sampling. Our quantitative and qualitative evaluations demonstrate the state-of-the-art performance of FrePolad in terms of quality, diversity, and computational efficiency.

# 139

Learning to Adapt SAM for Segmenting Cross-domain Point Clouds

Xidong Peng · Runnan Chen · Feng Qiao · Lingdong Kong · Youquan Liu · Yujing Sun · Tai Wang · Xinge Zhu · Yuexin Ma

Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a formidable challenge, primarily stemming from the sparse and unordered nature of point clouds. Especially for LiDAR point clouds, the domain discrepancy becomes obvious across varying capture scenes, fluctuating weather conditions, and the diverse array of LiDAR devices in use. Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. Specifically, we harness the corresponding images associated with point clouds to facilitate knowledge transfer and propose an innovative hybrid feature augmentation methodology, which enhances the alignment between the 3D feature space and SAM's feature space, operating at both the scene and instance levels. Our method is evaluated on many widely-recognized datasets and achieves state-of-the-art performance.

# 237

Osmosis: RGBD Diffusion Prior for Underwater Image Restoration

Opher Bar Nathan · Deborah Steinberger-Levy · Tali Treibitz · Dan Rosenbaum

Underwater image restoration is a challenging task because of strong water effects that increase dramatically with distance. This is worsened by lack of ground truth data of clean scenes without water. Diffusion priors have emerged as strong image restoration priors. However, they are often trained with a dataset of the desired restored output, which is not available in our case. To overcome this critical issue, we show how to leverage in-air images to train diffusion priors for underwater restoration. We also observe that only color data is insufficient, and augment the prior with a depth channel. We train an unconditional diffusion model prior on the joint space of color and depth, using standard RGBD datasets of natural outdoor scenes in air. Using this prior together with a novel guidance method based on the underwater image formation model, we generate posterior samples of clean images, removing the water effects. Even though our prior did not see any underwater images during training, our method outperforms state-of-the-art baselines for image restoration on very challenging scenes.

# 197

Strong Double Blind

Differentiable Product Quantization for Memory Efficient Camera Relocalization

Zakaria Laskar · Iaroslav Melekhov · Assia Benbihi · Shuzhe Wang · Juho Kannala

We revisit the problem of camera relocalization under memory budget through a combination of product quantization and map compression. This achieves high compression rates but leads to performance drop. To address memory performance tradeoff, we train a light-weight scene-specific auto-encoder network that performs quantization-dequantization in an end-to-end differentiable manner updating both product quantization centroids and network parameters. Unlike standard L2 reconstruction loss for training auto-encoder network, we show that additional margin-based metric losses are key to achieve good performance. Results show that for a descriptor memory of 1 MB, we can achieve competitive performance on Aachen Day with only 5 % drop in performance.

# 137

RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields

Doriand Petit · Steve Bourgeois · Dumitru Pavel · Vincent Gay-Bellile · Florian Chabot · Loïc Barthe

Recent advances in Neural Fields mostly rely on developing task-specific supervision which often complicates the models. Rather than developing hard-to-combine and specific modules, another approach generally overlooked is to directly inject generic priors on the scene representation (also called inductive biases) into the NeRF architecture. Based on this idea, we propose the RING-NeRF architecture which includes two inductive biases : a continuous multi-scale representation of the scene and an invariance of the decoder's latent space over spatial and scale domains. We also design a single reconstruction process that takes advantage of those inductive biases and experimentally demonstrates on-par performances in terms of quality with dedicated architecture on multiple tasks (anti-aliasing, few view reconstruction, SDF reconstruction without scene-specific initialization) while being more efficient. Moreover, RING-NeRF has the distinctive ability to dynamically increase the resolution of the model, opening the way to adaptive reconstruction.

# 196

Strong Double Blind

Light-in-Flight for a World-in-Motion

Jongho Lee · Ryan J Suess · Mohit Gupta

Although time-of-flight (ToF) cameras are becoming the sensor-of-choice for numerous 3D imaging applications in robotics, augmented reality (AR) and human-computer interfaces (HCI), they do not explicitly consider scene or camera motion. Consequently, current ToF cameras do not provide 3D motion information, and the estimated depth and intensity often suffers from significant motion artifacts in dynamic scenes. In this paper, we propose a novel ToF imaging method for dynamic scenes, with the goal of simultaneously estimating 3D geometry, intensity, and 3D motion using a single indirect ToF (I-ToF) camera. Our key observation is that we can estimate 3D motion, as well as motion artifact-free depth and intensity by designing optical-flow-like algorithms that operate on coded correlation images captured by an I-ToF camera. Through the integration of a multi-frequency I-ToF approach with burst imaging, we demonstrate high-quality all-in-one (3D geometry, intensity, 3D motion) imaging even in challenging low signal-to-noise ratio scenarios. We show the effectiveness of our approach through thorough simulations and real experiments conducted across a wide range of motion and imaging scenarios, including indoor and outdoor dynamic scenes.

# 194

Binomial Self-compensation for Motion Error in Dynamic 3D Scanning

Geyou Zhang · Ce Zhu · Kai Liu

Phase shifting profilometry (PSP) is favored in high-precision 3D scanning due to its high accuracy, robustness, and pixel-wise property. However, a fundamental assumption of PSP that the object should remain static is violated in dynamic measurement, making PSP susceptible to object moving, resulting in ripple-like errors in the point clouds. We propose a pixel-wise and frame-wise loopable binomial self-compensation (BSC) algorithm to effectively and flexibly eliminate motion error in the four-step PSP. Our mathematical model demonstrates that by summing successive motion-affected phase frames weighted by binomial coefficients, motion error exponentially diminishes as the binomial order increases, enabling automatic error compensation through the motion-affected phase sequence, without the assistance of any intermediate variable. We demonstrate that our BSC outperforms the existing methods in reducing motion error, while achieving a depth map frame rate equal to the camera's acquisition rate (90 fps), enabling high-accuracy 3D reconstruction with a quasi-single-shot frame rate.

# 188

Strong Double Blind

Non-Line-of-Sight Estimation of Fast Human Motion with Slow Scanning Imagers

Javier Grau Chopite · Patrick Hähn · Matthias B Hullin

Non-line-of-sight (NLoS) reconstruction, i.e., the task of imaging scenes beyond the camera's field of view, is often implemented using source-and-sensor systems that scan the visible region and analyze secondary reflections of light that has interacted with the hidden static scene. Estimating human activity around the corner will be a task of major interest for emerging NLoS applications, and some attempts have been reported in the recent literature. However, due to the long exposure times and comprehensive scans needed for NLoS sensing, the reconstruction of continuous movement remains prone to artifacts and is unreliable. In this paper, we analyze the interplay between dynamic scenes and scanning hardware to identify possible failure cases for filtering and data-driven approaches. Our studies indicate that existing reconstruction methods are prone to systematic error due to the space-time skew introduced by scanning setups. To alleviate this issue, we propose an image formation model for dynamic scenes that explicitly integrates motion skew. Using this model, we construct a baseline method for human pose estimation that achieves high accuracy, even at very slow scan rates.

# 192

Strong Double Blind

Synchronization of Projective Transformations

Rakshith Madhavan · Andrea Fusiello · Federica Arrigoni

Synchronization involves the task of inferring unknown vertex values (belonging to a group) in a graph, from edges labeled with vertex relations. While many matrix groups (e.g., rotations or permutations) have received extensive attention in Computer Vision, a complete solution for projectivities is lacking. Only the 3x3 case has been addressed so far, by mapping the problem onto the Special Linear Group, but the 4x4 projective case has remained unexplored and is the focus here. We propose novel strategies to address this task, and demonstrate their effectiveness in synthetic experiments, as well as on an application to projective Structure from Motion.

# 187

Strong Double Blind

Semicalibrated Relative Pose from an Affine Correspondence and Monodepth

Petr Hrubý · Marc Pollefeys · Daniel Barath

We address the semi-calibrated relative pose estimation problem where we assume the principal point to be in the center of the image and estimate the focal lengths, relative rotation, and translation of two cameras. We introduce the first minimal solver that requires only a single affine correspondence in conjunction with predicted monocular depth. Recognizing its degeneracy when the correspondence stems from a fronto-parallel plane, we present an alternative solver adept at recovering the correct solution under such circumstances. By integrating these methods within the GC-RANSAC framework, we show they surpass standard approaches, delivering more accurate poses at comparable runtimes across large-scale, publicly available indoor and outdoor datasets. The code will be public.

# 175

Strong Double Blind

GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring

Emanuele Santellani · Martin Zach · Christian Sormann · Mattia Rossi · Andreas Kuhn · Friedrich Fraundorfer

The extraction of keypoints in images is at the basis of many computer vision applications, from localization to 3D reconstruction. Keypoints come with a score permitting to rank them according to their quality. While learned keypoints often exhibit better properties than handcrafted ones, their scores are not easily interpretable, making it virtually impossible to compare the quality of individual keypoints across methods. We propose a framework that can refine, and at the same time characterize with an interpretable score, the keypoints extracted by any method. Our approach leverages a modified robust Gaussian Mixture Model fit designed to both reject non-robust keypoints and refine the remaining ones. Our score comprises two components: one relates to the probability of extracting the same keypoint in an image captured from another viewpoint, the other relates to the localization accuracy of the keypoint. These two interpretable components permit a comparison of individual keypoints extracted across different methods. Through extensive experiments we demonstrate that, when applied to popular keypoint detectors, our framework consistently improves the repeatability of keypoints as well as their performance in homography and two/multiple-view pose recovery tasks.

# 171

Strong Double Blind

LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System

Hongbeen Park · Minjeong Park · Giljoo Nam · Jinkyu Kim

Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in achieving real-time performance, robustness, and scalability for large-scale scenes. Recent approaches utilizing neural implicit scene representations show promise but suffer from high computational costs and memory requirements. ESLAM introduced a plane-based tensor decomposition but still struggled with memory growth. Addressing these challenges, we propose a more efficient visual SLAM model, called LRSLAM, utilizing low-rank tensor decomposition methods. Our approach, leveraging the Six-axis and CP decompositions, achieves better convergence rates, memory efficiency, and reconstruction/localization quality than existing state-of-the-art approaches. Evaluation across diverse indoor RGB-D datasets demonstrates LRSLAM's superior performance in terms of parameter efficiency, processing time, and accuracy, retaining reconstruction and localization quality. Our code will be publicly available upon publication.

# 189

Strong Double Blind

SRPose: Two-view Relative Pose Estimation with Sparse Keypoints

Rui Yin · Yulun Zhang · Zherong Pan · Jianjun Zhu · Cheng Wang · Biao Jia

Two-view pose estimation is essential for map-free visual relocalization and object pose tracking tasks. However, traditional matching methods suffer from time-consuming robust estimators, while deep learning-based pose regressors only cater to camera-to-world pose estimation, lacking generalizability to different image sizes and camera intrinsics. In this paper, we propose SRPose, a sparse keypoint-based framework for two-view relative pose estimation in camera-to-world and object-to-camera scenarios. SRPose consists of a sparse keypoint detector, an intrinsic-calibration position encoder, and promptable prior knowledge-guided attention layers. Given two RGB images of a fixed scene or a moving object, SRPose estimates the relative camera or 6D object pose transformation. Extensive experiments demonstrate that SRPose achieves competitive or superior performance compared to state-of-the-art methods in terms of accuracy and speed, showing generalizability to both scenarios. It is robust to varying different image sizes and camera intrinsics, and can be deployed with low computing resources.

# 180

Strong Double Blind

Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences

Shishir Reddy Vutukur · Junwen Huang · Rasmus Laurvig Haugaard · Benjamin Busam · Tolga Birdal

Object pose distribution estimation is crucial in robotics for better path planning and handling of symmetric objects. Recent distribution estimation approaches employ contrastive learning-based approaches by maximizing the likelihood of a single pose estimate in the absence of a CAD model. We propose a pose distribution estimation method leveraging symmetry respecting correspondence distributions and shape information obtained using a CAD model. Contrastive learning-based approaches require an exhaustive amount of training images from different viewpoints to learn the distribution properly, which is not possible in realistic scenarios. Instead, we propose a pipeline that can leverage correspondence distributions and shape information from the CAD model, which are later used to learn pose distributions. Besides, having access to pose distribution based on correspondences before learning pose distributions conditioned on images, can help formulate the loss between distributions. The prior knowledge of distribution also helps the network to focus on getting sharper modes instead. With the CAD prior, our approach converges much faster and learns distribution better by focusing on learning sharper distribution near all the valid modes, unlike contrastive approaches, which focus on a single mode at a time. We achieve benchmark results on SYMSOL-I and T-Less datasets.

# 177

Strong Double Blind

U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation

li zhang · Weiqing Meng · Yan Zhong · Bin Kong · Mingliang Xu · Jianming Du · Xue Wang · Rujing Wang · Liu Liu

Rigid and articulated objects are common in our daily lives. Pose estimation tasks for both types of objects have been extensively studied within their respective domains. However, a universal framework capable of estimating the pose of both rigid and articulated objects has yet to be reported. In this paper, we introduce a Universal 9D Category-level Object Pose Estimation (U-COPE) framework, designed to address this gap. Our approach offers a novel perspective on rigid and articulated objects, redefining their pose estimation problems to unify them into a common task. Leveraging either 3D point cloud or RGB-D image inputs, we extract Point Pair Features ~(PPF) independently from each object part for end-to-end learning. Moreover, instead of direct prediction as seen in prior art, we employ a universal voting strategy to derive decisive parameters crucial for object pose estimation. Our network is trained end-to-end to optimize three key objectives: Joint Information, Part Segmentation, and 9D pose estimation through parameter voting. Extensive experiments validate the robustness of our method in estimating poses for both rigid and articulated objects, which demonstrates the generalizability to unseen object instances, too. Notably, our approach achieves state-of-the-art performance on synthetic datasets and real-world datasets. Our code will be publicly available soon.

# 178

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

Chenhongyi Yang · Anastasia Tkach · Shreyas Hampali · Linguang Zhang · Elliot J Crowley · Cem Keskin

We present EgoPoseFormer, a simple yet effective transformer-based model for stereo egocentric human pose estimation. The main challenge in egocentric pose estimation is overcoming joint invisibility, which is caused by self-occlusion or a limited field of view (FOV) of head-mounted cameras. Our approach overcomes this challenge by incorporating a two-stage pose estimation paradigm: in the first stage, our model leverages the global information to estimate each joint’s coarse location, then in the second stage, it employs a DETR style transformer to refine the coarse locations by exploiting fine-grained stereo visual features. In addition, we present a deformable stereo attention operation to enable our transformer to effectively process multi-view features, which enables it to accurately localize each joint in the 3D world. We evaluate our method on the stereo UnrealEgo dataset and show it significantly outperforms previous approaches while being computationally efficient: it improves MPJPE by 27.4mm (45% improvement) with only 7.9% model parameters and 13.1% FLOPs compared to the state-of-the-art. Surprisingly, with proper training techniques, we find that even our first-stage pose proposal network can achieve superior performance compared to previous arts. We also show that our method can be seamlessly extended to monocular settings, which achieves state-of-the-art performance on the SceneEgo dataset, improving MPJPE by 25.5mm (21% improvement) compared to the best existing method with only 60.7% model parameters and 36.4% FLOPs. Code will be publicly available.

# 130

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel · Thomas Lucas · Matthieu Armando · Salma Galaaoui · Romain Brégier · Philippe Weinzaepfel · Grégory Rogez

We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating this dataset into training further enhances predictions, particularly for hands, enabling us to achieve competitive performance. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously. We train models with various backbone sizes and input resolutions. In particular, using a ViT-S backbone and 448x448 input images already yields a fast and competitive model, while larger models and higher resolutions further improves performance.

# 190

Strong Double Blind

Cut out the Middleman: Revisiting Pose-based Gait Recognition

YANG FU · Saihui Hou · Shibei Meng · Xuecai Hu · Chunshui Cao · Xu Liu · Yongzhen Huang

Recent pose-based gait recognition methods, which utilize human skeletons as the model input, have demonstrated significant potential in handling variations in clothing and occlusions. However, methods relying on such skeleton to encode pose are constrained mainly by two problems: (1) poor performance caused by the shape loss, and (2) lack of generalizability. Addressing these limitations, we revisit pose-based gait recognition and develop GaitHeat, a heatmap-based framework that largely enhances performance and robustness by utilizing a new modality to encode pose rather than keypoint coordinates. We make our efforts from two aspects, the pipeline and the extraction of multi-channel heatmap features. Specifically, the process of resizing and centering is performed in the RGB space to largely preserve the integrity of heatmap information. To boost the generalization across various datasets further, we propose a pose-guided heatmap alignment module to eliminate the influence of gait-irrelevant covariates. Furthermore, a global-local network incorporating an efficient fusion branch is designed to improve the extraction of semantic information. Compared to skeleton-based methods, GaitHeat exhibits superior performance in learning gait features and demonstrates effective generalization across different datasets. Experiments on three datasets reveal that our proposed method achieves state-of-the-art results for pose-based gait recognition, comparable to that of silhouette-based approaches. The code will be made available upon acceptance.

# 122

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

Rosario Leonardi · Antonino Furnari · Francesco Ragusa · Giovanni Maria Farinella

In this study, we investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only 10% of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: +5.67% on EPIC-KITCHENS VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Data, code, and data generation tools to support future research are released at: https://fpv-iplab.github.io/HOI-Synth/.

# 195

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Jiaxi Jiang · Paul Streli · Manuel Meier · Christian Holz

Full-body ego-pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representation on headset-based platforms. However, existing methods over-rely on the confines of the motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous capture of joint motions and uniform body dimensions. In this paper, we propose EgoPoser, which overcomes these limitations by 1) rethinking the input representation for headset-based ego-pose estimation and introducing a novel motion decomposition method that predicts full-body pose independent of global positions, 2) robustly modeling body pose from intermittent hand position and orientation tracking only when inside a headset's field of view, and 3) generalizing across various body sizes for different users. Our experiments show that EgoPoser outperforms state-of-the-art methods both qualitatively and quantitatively, while maintaining a high inference speed of over 600 fps. EgoPoser establishes a robust baseline for future work, where full-body pose estimation needs no longer rely on outside-in capture and can scale to large-scene environments.

# 184

Strong Double Blind

3D Hand Sequence Recovery from Real Blurry Images and Event Stream

Joonkyu Park · Gyeongsik Moon · Weipeng Xu · Evan Kaseman · Takaaki Shiratori · Kyoung Mu Lee

Although hands frequently exhibit motion blur due to their dynamic nature, existing approaches for 3D hand recovery often disregard the impact of motion blur in hand images. Blurry hand images contain hands from multiple time steps, lacking precise hand location at a specific time step and introducing temporal ambiguity, leading to multiple possible hand trajectories. To address this issue and in the absence of datasets with real blur, we introduce the EBH dataset, which provides 1) hand images with real motion blur and 2) event data for authentic representation of fast hand movements. In conjunction with our new dataset, we present EBHNet, a novel network capable of recovering 3D hands from diverse input combinations, including blurry hand images, events, or both. Here, the event stream enhances motion understanding in blurry hands, addressing temporal ambiguity. Recognizing that blurry hand images include not only single 3D hands at a time step but also multiple hands along their motion trajectories, we design EBHNet to generate 3D hand sequences in motion. Moreover, to enable our EBHNet to predict 3D hands at novel, unsupervised time steps using a single shared module, we employ a Transformer-based module, temporal splitter, into EBHNet. Our experiments show the superior performance of EBH and EBHNet, especially in handling blurry hand images, making them valuable in real-world applications. The code and dataset will be released.

# 174

Strong Double Blind

Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics

Woojin Cho · Jihyun Lee · Minjae Yi · Minje Kim · Taeyun Woo · Donghwan Kim · Taewook Ha · Hyokeun Lee · Je-Hwan Ryu · Woontack Woo · Tae-Kyun Kim

Existing datasets for 3D hand-object interaction are limited either in the data cardinality, data variations in interaction scenarios, or the quality of annotations. In this work, we present a comprehensive new training dataset for hand-object interaction called HOGraspNet. It is the only real dataset that captures full grasp taxonomies, providing grasp annotation and wide intraclass variations. Using grasp taxonomies as atomic actions, their space and time combinatorial can represent complex hand activities around objects. We select 22 rigid objects from the YCB dataset and 8 other compound objects using shape and size taxonomies, ensuring coverage of all hand grasp configurations. The dataset includes diverse hand shapes from 99 participants aged 10 to 74, continuous video frames, and a 1.5M RGB-Depth of sparse frames with annotations. It offers labels for 3D hand and object meshes, 3D keypoints, contact maps, and \emph{grasp labels}. Accurate hand and object 3D meshes are obtained by fitting the hand parametric model (MANO) and the hand implicit function (HALO) to multi-view RGBD frames, with the MoCap system only for objects. Note that HALO fitting does not require any parameter tuning, enabling scalability to the dataset's size with comparable accuracy to MANO. We evaluate HOGraspNet on relevant tasks: grasp classification and 3D hand pose estimation. The result shows performance variations based on grasp type and object class, indicating the potential importance of the interaction space captured by our dataset. The provided data aims at learning universal shape priors or foundation models for 3D hand-object interaction. Our dataset and code are available at https://hograspnet2024.github.io/.

# 179

Learning Cross-hand Policies of High-DOF Reaching and Grasping

Qijin She · Shishun Zhang · Yunfan Ye · Ruizhen Hu · Kai Xu

Reaching-and-grasping is a fundamental skill for robotic ma-nipulation, but existing methods usually train models on a specific gripper and cannot be reused on another gripper without retraining. In this paper, we propose a novel method that can learn a unifed policy modelthat can be easily transferred to different dexterous grippers. Our method consists of two stages: a gripper-agnostic policy model that predicts thedisplacements of predefined key points on the gripper, and a gripperspecifc adaptation model that translates these displacements into adjustments for controlling the grippers’ joints.The gripper state and interactions with objects are captured at the finger level using robust geometric representations integrated with a transformer-based network to address variations in gripper morphology and geometry. We evaluateour method on several dexterous grippers and objects of diverse shapes.And the result shows our method signifcantly outperforms the base-line methods. Our method pioneers the transfer of grasp policies acrossdifferent dexterous grippers, and demonstrates the potential of learn-ing generalizable and transferable manipulation skills for various robotichands.

# 202

Strong Double Blind

Free-Viewpoint Video of Outdoor Sports Using a Drone

Zhengdong Hong

We propose a novel drone application under real-world scenarios – free-viewpoint rendering of outdoor sports scenes, including the dynamic athlete and the 360° background. Outdoor sports have long-range human motions and large-scale scene structures which make the task rather challenging. Existing methods either rely on dense camera arrays which costs much, or a handheld moving camera which struggles to handle real sports scenes. We build a novel drone-based system using an RGB camera to reconstruct the 4D dynamic human along with the 3D unbounded scene, rendering free-viewpoint videos at any time. We also propose submodules for calibration and human motion capture, as a system-level design for improved robustness and efficiency. We collect a dataset AerialRecon and conduct extensive experiments on real-world scenarios. Compared with existing SOTA systems, our system demonstrates superior performance and applicability to real-world outdoor sports scenes.

# 135

Strong Double Blind

Unsupervised Exposure Correction

Ruodai Cui · Li Niu · Guosheng Hu

Current exposure correction methods encounter three challenges, labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. In this work, we introduce an innovative Unsupervised Exposure Correction (UEC) method that eliminates the need for manual annotations, offers improved generalizability, and enhances performance in low-level downstream tasks. Our model is trained using freely available paired data from an emulated Image Signal Processing (ISP) pipeline. This approach does not need expensive manual annotations, thereby minimizing individual style biases from the annotation and consequently improving its generalizability. Furthermore, we present a large-scale Radiometry Correction Dataset, specifically designed to emphasize exposure variations, to facilitate unsupervised learning. In addition, we develop a transformation function that preserves image details and outperforms state-of-the-art supervised methods, while utilizing only 0.01% of their parameter count. Our work further investigates the broader impact of exposure correction on downstream tasks, including edge detection, demonstrating its effectiveness in mitigating the adverse effects of poor exposure on low-level features. The source code and dataset will be made publicly available.

# 136

Strong Double Blind

Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training

Yuanqi Yao · Gang Wu · Kui Jiang · Siao Liu · Jian Kuai · Xianming Liu · Junjun Jiang

Learning a self-supervised Monocular Depth Estimation (MDE) model with great generalization remains significantly challenging. Despite the success of adversarial augmentation in the supervised learning generalization, naively incorporating it into self-supervised MDE methods potentially causes over-regularization, suffering from severe performance degradation. In this paper, we conduct qualitative analysis and illuminate the main causes: (i) inherent sensitivity in the UNet-alike depth network and (ii) dual optimization conflict caused by over-regularization. To tackle these issues, we propose a general adversarial training framework, named Stabilized Conflict-optimization Adversarial Training (SCAT), integrating adversarial data augmentation into self-supervised MDE methods to achieve a balance between stability and generalization. Specifically, we devise an effective Scaling Depth Network that tunes the coefficients of long skip connection in DepthNet and effectively stabilizes the training process. Then, we propose a Conflict Gradient Surgery strategy, which progressively integrates the adversarial gradient and optimizes the model toward a conflict-free direction. Extensive experiments on five benchmarks demonstrate that SCAT can achieve state-of-the-art performance and significantly improve the generalization capability of existing self-supervised MDE methods.

# 204

Strong Double Blind

Deep Cost Ray Fusion for Sparse Depth Video Completion

Jungeon Kim · Soongjin Kim · Jaesik Park · Seungyong Lee

In this paper, we present a learning-based framework for improved sparse depth video completion. Given a sparse depth map and a color image, our approach makes a cost volume of a certain viewpoint that constructed on depth hypothesis planes. To effectively handle sequential cost volumes of the multiple viewpoints, we introduce a learning-based cost volume fusion framework, namely RayFusion, that effectively leverages the attention mechanism for each pair of overlapped rays in cost volumes. As a result of leveraging feature statistics accumulated over time, our proposed framework consistently outperforms or rivals state-of-the-art approaches on diverse indoor and outdoor datasets, including the KITTI Depth Completion benchmark, VOID Depth Completion benchmark, and ScanNetV2 dataset, using 94.5% fewer network parameters than the state-of-the-art approach, LRRU.

# 160

PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

Zhenyu Li · Shariq Farooq Bhat · Peter Wonka

This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner's superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.

# 166

Strong Double Blind

Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor

Andrea Conti · Matteo Poggi · Valerio CAMBARERI · Stefano Mattoccia

High frame rate and accurate depth estimation plays an important role in several tasks crucial to robotics and automotive perception. To date, this can be achieved through ToF and LiDAR devices for indoor and outdoor applications, respectively. However, their applicability is limited by low frame rate, energy consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate temporal and spatial depth densification achieved by exploiting a high frame rate RGB sensor coupled with a potentially lower frame rate and sparse active depth sensor. Our proposal jointly enables lower energy consumption and denser shape reconstruction by significantly reducing the streaming requirements on the depth sensor. We present extended evidence assessing the effectiveness of DoD on indoor and outdoor video datasets, covering both environment scanning and automotive perception use cases.

# 168

Strong Double Blind

UniCal: Unified Neural Sensor Calibration

Ze Yang · George G Chen · Haowei Zhang · Kevin Ta · Ioan Andrei Bârsan · Daniel Murphy · Sivabalan Manivasagam · Raquel Urtasun

Self-driving vehicles (SDVs) require accurate calibration of LiDARs and cameras to fuse sensor data accurately for autonomy. Traditional calibration methods typically leverage fiducials captured in a controlled and structured scene and compute correspondences to optimize over. These approaches are costly and require substantial infrastructure and operations, making it challenging to scale for vehicle fleets. In this work, we propose UniCal, a unified framework for effortlessly calibrating SDVs equipped with multiple LiDARs and cameras. Our approach is built upon a differentiable scene representation capable of rendering multi-view geometrically and photometrically consistent sensor observations. We jointly learn the sensor calibration and the underlying scene representation through differentiable volume rendering, utilizing outdoor sensor data without the need for specific calibration fiducials. This "drive-and-calibrate" approach significantly reduces costs and operational overhead compared to existing calibration systems, enabling efficient calibration for large SDV fleets at scale. To ensure geometric consistency across observations from different sensors, we introduce a novel surface alignment loss that combines feature-based registration with neural rendering, as well as a coarse-to-fine sampling approach to optimize regions of interest for sensor alignment. Comprehensive evaluations on multiple datasets demonstrate that UniCal outperforms or matches the accuracy of existing calibration approaches while being more efficient, demonstrating the value of UniCal for scalable calibration.

# 127

Strong Double Blind

Multi-modal Crowd Counting via a Broker Modality

Haoliang Meng · Xiaopeng Hong · Chenhao Wang · Miao Shang · Wangmeng Zuo

Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. We will release the source code upon the acceptance of the paper.

# 134

Strong Double Blind

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

Jinghua Hou · Tong Wang · Xiaoqing Ye · Zhe Liu · Shi Gong · Xiao Tan · Errui Ding · Jingdong Wang · Xiang Bai

Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D detectors due to the lack of the depth of 3D object center; 2) for distant objects, fine-grained depth estimation of the whole object is more challenging. Therefore, we argue that the object-wise depth (or 3D center of the object) is essential for accurate detection. In this paper, we propose a new multi-view 3D object detector named OPEN, whose main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. Specifically, we first employ an object-wise depth encoder, which takes the pixel-wise depth map as a prior, to accurately estimate the object-wise depth. Then, we utilize the proposed object-wise position embedding to encode the object-wise depth information into the transformer decoder, thereby producing 3D object-aware features for final detection. Extensive experiments verify the effectiveness of our proposed method. Furthermore, OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark. Code will be available.

# 132

Strong Double Blind

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

Zheng Jiang · Jinqing Zhang · Yanan Zhang · Qingjie Liu · Zhenghui HU · Baohui Wang · Yunhong Wang

Although multi-view 3D object detection based on the Bird's-Eye-View (BEV) paradigm has garnered widespread attention as an economical and deployment-friendly perception solution for autonomous driving, there is still a performance gap compared to LiDAR-based methods. In recent years, several cross-modal distillation methods have been proposed to transfer beneficial information from teacher models to student models, with the aim of enhancing performance. However, these methods face challenges due to discrepancies in feature distribution originating from different data modalities and network structures, making knowledge transfer exceptionally challenging. In this paper, we propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies, maintaining remarkable distillation effects without the need for pre-trained teacher models or cumbersome distillation strategies. Additionally, we design two Point Cloud Intensification (PCI) strategies to compensate for the sparsity of point clouds by frame combination and pseudo point assignment. Finally, we develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features by predicted elliptical Gaussian heatmap, further improving the model's performance. We integrate all the above innovations into a unified framework named FSD-BEV. Extensive experiments on the nuScenes dataset exhibit that FSD-BEV achieves state-of-the-art performance, highlighting its effectiveness.

# 69

Strong Double Blind

MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain

Timothy Chase · Karthik Dantu

The visual detection and tracking of surface terrain is required for spacecraft to safely land on or navigate within close proximity to celestial objects. Current approaches rely on template matching with pre-gathered patch-based features, which are expensive to obtain and a limiting factor in perceptual capability. While recent literature has focused on in-situ detection methods to enhance navigation and operational autonomy, robust description is still needed. In this work, we explore metric learning as the lightweight feature description mechanism and find that current solutions fail to address inter class similarity and multi-view observational geometry. We attribute this to the view-unaware attention mechanism and introduce Multi-view Attention Regularizations (MARs) to constrain the channel and spatial attention across multiple feature views, regularizing the what and where of attention focus. We thoroughly analyze many modern metric learning losses with and without MARs and demonstrate improved terrain-feature recognition performance by upwards of 85%. We additionally introduce the Luna-1 dataset, consisting of Moon crater landmarks and reference navigation frames from NASA mission data to support future research in this difficult task. Luna-1 and source code are publicly available at https://droneslab.github.io/mars/.

# 129

SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data

Jialong Wu · Mirko Meuter · Markus Schoeler · Matthias Rottmann

Radar-based perception has gained increasing attention in autonomous driving, yet the inherent sparsity of radars poses challenges. Radar raw data often contains excessive noise, whereas radar point clouds retain only limited information. In this work, we holistically treat the sparse nature of radar data by introducing an adaptive subsampling method together with a tailored network architecture that exploits the sparsity patterns to discover global and local dependencies in the radar signal. Our subsampling module selects a subset of pixels from range-doppler (RD) spectra that contribute most to the downstream perception tasks. To improve the feature extraction on sparse subsampled data, we propose a new way of applying graph neural networks on radar data and design a novel two-branch backbone to capture both global and local neighbor information. An attentive fusion module is applied to combine features from both branches. Experiments on the RADIal dataset show that our SparseRadNet exceeds state-of-the-art (SOTA) performance in object detection and achieves close to SOTA accuracy in freespace segmentation, meanwhile using sparse subsampled input data.

# 143

UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Jian Zou · Tianyu Huang · Guanglei Yang · Zhenhua Guo · Tao Luo · Chun-Mei Feng · Wangmeng Zuo

Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM^2AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space, ingeniously expanded from the bird's eye view (BEV) to include the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM^2AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\% NDS and 6.5\% mIoU, respectively.

# 67

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

Sergio Casas · Ben T Agro · Jiageng Mao · Thomas Gilles · ALEXANDER Y CUI · Enxu Li · Raquel Urtasun

The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These tasks are typically executed in a cascading manner, making them prone to compounding errors. Furthermore, there is usually a very thin interface between the two tasks, creating a lossy information bottleneck. To address these challenges, our approach formulates the union of the two tasks as a trajectory refinement problem, where the first pose is the detection (current time), and the subsequent poses are the waypoints of the multiple forecasts (future time). To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects directly from LiDAR point clouds and high-definition maps. We call this model DeTra, short for object DEtection and TRAjectory forecasting. In our experiments, we observe that DeTra outperforms the state-of-the-art on Argoverse 2 Sensor and Waymo Open Dataset by a large margin, across a broad range of metrics. Last but not least, we perform extensive ablation studies that show the value of refinement for this task, that every proposed component contributes positively to its performance, and that key design choices were made.

# 68

RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

Xiaosu Zhu · Hualian Sheng · Sijia Cai · Bing Deng · Shaopeng Yang · Qiao Liang · Ken Chen · Lianli Gao · Jingkuan Song · Jieping Ye

We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird's Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 $m^2$. To relieve the expensive costs of roadside 3D labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, we organize a comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit are at \url{https://roscenes.github.io}.

# 161

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

Yunzhi Yan · Haotong Lin · Chenxu Zhou · Weijie Wang · Haiyang Sun · Kun Zhan · Xianpeng Lang · Xiaowei Zhou · Sida Peng

This paper aims to tackle the problem of modeling dynamic urban streets for autonomous driving scenes. Recent methods extend NeRF by incorporating tracked vehicle poses to animate vehicles, enabling photo-realistic view synthesis of dynamic urban street scenes. However, significant limitations are their slow training and rendering speed. We introduce Street Gaussians, a new explicit scene representation that tackles these limitations. Specifically, the dynamic urban scene is represented as a set of point clouds equipped with semantic logits and 3D Gaussians, each associated with either a foreground vehicle or the background. To model the dynamics of foreground object vehicles, each object point cloud is optimized with optimizable tracked poses, along with a 4D spherical harmonics model for the dynamic appearance. The explicit representation allows easy composition of object vehicles and background, which in turn allows for scene editing operations and rendering at 135 FPS (1066 * 1600 resolution) within half an hour of training. The proposed method is evaluated on multiple challenging benchmarks. Experiments show that Street Gaussians consistently outperforms state-of-the-art methods across all datasets. The code will be released to ensure reproducibility.

# 28

Strong Double Blind

PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines

Zidong Wang · Zeyu Lu · Di Huang · Tong He · Xihui Liu · Wanli Ouyang · Lei Bai

In this paper, we introduce PredBench, a benchmark tailored for the holistic evaluation of spatio-temporal prediction networks. Despite significant progress in this field, there remains a lack of a standardized framework for a detailed and comparative analysis of various prediction network architectures. PredBench addresses this gap by conducting large-scale experiments, upholding standardized and appropriate experimental settings, and implementing multi-dimensional evaluations. This benchmark integrates 12 widely adopted methods with 15 diverse datasets across multiple application domains, offering extensive evaluation of contemporary spatio-temporal prediction networks. Through meticulous calibration of prediction settings across various applications, PredBench ensures evaluations relevant to their intended use and enables fair comparisons. Moreover, its multi-dimensional evaluation framework broadens the analysis with a comprehensive set of metrics, providing deep insights into the capabilities of models. The findings from our research offer strategic directions for future developments in the field. Further, we will release our extensive codebase to encourage and support continued advancements in spatio-temporal prediction research.

# 140

Strong Double Blind

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Zhijian Liu · Zhuoyang Zhang · Samir Khaki · Shang Yang · Haotian Tang · Chenfeng Xu · Kurt Keutzer · Song Han

Semantic segmentation empowers numerous real-world applications, such as autonomous driving and augmented/mixed reality. These applications often operate on high-resolution images (e.g., 8 megapixels) to capture the fine details. However, this comes at the cost of considerable computational complexity, hindering the deployment in latency-sensitive scenarios. In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. Based on coarse low-resolution outputs, SparseRefine first uses an entropy selector to identify a sparse set of pixels with high entropy. It then employs a sparse feature extractor to efficiently generate the refinements for those pixels of interest. Finally, it leverages a gated ensembler to apply these sparse refinements to the initial coarse predictions. SparseRefine can be seamlessly integrated into any existing semantic segmentation model, regardless of CNN- or ViT-based. SparseRefine achieves significant speedup: 1.5 to 3.9 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy. We will release the code to reproduce our results. Our "dense+sparse'' paradigm paves the way for efficient high-resolution visual computing.

# 259

InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping

Zhenhua Xu · Kwan-Yee K. Wong · Hengshuang ZHAO

Vectorized high-definition (HD) maps contain detailed information about surrounding road elements, which are crucial for various downstream tasks in modern autonomous vehicles, such as motion planning and vehicle control. Recent works attempt to directly detect the vectorized HD map as a point set prediction task, achieving notable detection performance improvements. However, these methods usually overlook and fail to analyze the important inner-instance correlations between predicted points, impeding further advancements. To address this issue, we investigate the utilization of inner-instance information for vectorized high-definition mapping through transformers, and propose a powerful system named \textbf{InsMapper}, which effectively harnesses inner-instance information with three exquisite designs, including hybrid query generation, inner-instance query fusion, and inner-instance feature aggregation. The first two modules can better initialize queries for line detection, while the last one refines predicted line instances. InsMapper is highly adaptable and can be seamlessly modified to align with the most recent HD map detection frameworks. Extensive experimental evaluations are conducted on the challenging NuScenes and Argoverse 2 datasets, where InsMapper surpasses the previous state-of-the-art method, demonstrating its effectiveness and generality.

# 285

PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors

Tianyuan Yuan · Mao Yucheng · Jiawei Yang · Yicheng LIU · Yue Wang · Hang Zhao

Autonomous vehicles rely extensively on perception systems to navigate and interpret their surroundings. Despite significant advancements in these systems recently, challenges persist under conditions like occlusion, extreme lighting, or in unfamiliar urban areas. Unlike these systems, humans do not solely depend on immediate observations to perceive the environment. In navigating new cities, humans gradually develop a preliminary mental map to supplement real-time perception during subsequent visits. Inspired by this human approach, we introduce a novel framework, Pre-Sight, that leverages past traversals to construct static prior memories, enhancing online perception in later navigations. Our method involves optimizing a city-scale neural radiance field with data from previous journeys to generate neural priors. These priors, rich in semantic and geometric details, are derived without manual annotations and can seamlessly augment various state-of-the-art perception models, improving their efficacy with minimal additional computational cost. Experimental results on the nuScenes dataset demonstrate the framework's high compatibility with diverse online perception models. Specifically, it shows remarkable improvements in HD-map construction and occupancy prediction tasks, highlighting its potential as a new perception framework for autonomous driving systems.

# 300

Strong Double Blind

Unified Local-Cloud Decision-Making via Reinforcement Learning

Kathakoli Sengupta · Zhongkai Shangguan · Sandesh Bharadwaj · Sanjay Arora · Eshed Ohn-Bar · Renato Mancuso

Embodied vision-based real-world systems, such as mobile robots, require careful balancing between energy consumption, compute latency, and safety constraints to optimize operation across dynamic tasks and contexts. As local computation tends to be restricted, offloading the computation, i.e., to a remote server, can save local resources while providing access to high-quality predictions from powerful and large models. Yet, the resulting communication and latency overhead has led to limited usability of cloud models in dynamic, safety-critical, real-time settings. Towards effectively addressing this trade-off, in this work, we introduce UniLCD, a novel hybrid inference framework for enabling flexible local-cloud collaboration. By efficiently optimizing a flexible routing module via reinforcement learning and a suitable multi-task objective, UniLCD is specifically designed to support multiple constraints of safety-critical end-to-end mobile systems. We validate the proposed approach using a challenging crowded navigation task requiring frequent and timely switching between local and cloud operations. UniLCD demonstrates both improved overall performance and efficiency, by over 17% compared to state-of-the-art baselines based on various split computing strategies.

# 297

Generative End-to-End Autonomous Driving

Wenzhao Zheng · Ruiqi Song · Xianda Guo · Chenming Zhang · Long Chen

Directly producing planning results from raw sensors has been a long-desired solution for autonomous driving and has attracted increasing attention recently. Most existing end-to-end autonomous driving methods factorize this problem into perception, motion prediction, and planning. However, we argue that the conventional progressive pipeline still cannot comprehensively model the entire traffic evolution process, e.g., the future interaction between the ego car and other traffic participants and the structural trajectory prior. In this paper, we explore a new paradigm for end-to-end autonomous driving, where the key is to predict how the ego car and the surroundings evolve given past scenes. We propose GenAD, a generative framework that casts autonomous driving into a generative modeling problem. We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling. We further adopt a temporal model to capture the agent and ego movements in the latent space to generate more effective future trajectories. GenAD finally simultaneously performs motion prediction and planning by sampling distributions in the learned structural latent space conditioned on the instance tokens and using the learned temporal model to generate futures. Extensive experiments on the widely used nuScenes benchmark show that the proposed GenAD achieves state-of-the-art performance on vision-centric end-to-end autonomous driving with high efficiency.

# 280

Strong Double Blind

MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction

Seongju Lee · Junseok Lee · Yeonguk Yu · Taeri Kim · KYOOBIN LEE

Multi-agent trajectory prediction is crucial to autonomous driving and understanding the surrounding environment. In recent years, learning-based approaches for multi-agent trajectory prediction, such as primarily relying on graph neural networks, graph transformers, and hypergraph neural networks, have demonstrated outstanding performance on real-world datasets. However, hypergraph transformer-based method for trajectory prediction is yet to be explored. Therefore, we present a MultiscAle Relational Transformer (MART) network for multi-agent trajectory prediction. MART is a hypergraph transformer architecture to consider individual and group behaviors in transformer machinery. The core module of MART is the encoder, comprising a Pair-wise Relational Transformer (PRT) and a Hyper Relational Transformer (HRT). The encoder extends the capabilities of a relational transformer by introducing HRT, which integrates hyperedge features into the transformer mechanism, promoting the attention weights to focus on group-wise relations. In addition, we propose an Adaptive Group Estimator (AGE) designed to infer complex group relations in real-world environments. Extensive experiments on three real-world datasets (NBA, SDD, and ETH-UCY) demonstrate that our method achieves state-of-the-art performance, enhancing ADE/FDE by 3.9%/11.8% on the NBA dataset.

# 288

Strong Double Blind

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng · Wenjie Luo · Yiren Lu · Tianyi Shen · Cole Gulino · Ari Seff · Justin Fu

A major challenge in autonomous vehicle research is agents behavior modeling, which has critical applications including constructing realistic and reliable simulations for off-board evaluation and motion forecasting of traffic agents for onboard planning. Motion prediction models trained via supervised learning have recently proven effective at modeling agents across many domains. However, these models are subject to distribution shift when deployed at test-time. In this work, we improve the reliability of agent behaviors by closed-loop fine-tuning of behavior models with reinforcement learning. We demonstrate that we can improve overall performance, as well as targeted metrics such as collision rate, on the Waymo Open Sim Agents challenge benchmark. We also introduce a novel policy evaluation ranking benchmark for directly evaluating the ability of sim agents to measure planning quality and demonstrate the effectiveness of our approach on this new benchmark.

# 133

Strong Double Blind

LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow

Hongyu Wen · Erich Liang · Jia Deng

Achieving 3D understanding of non-Lambertian objects is an important task with many useful applications, but most existing algorithms struggle to deal with such objects. One major obstacle towards progress in this field is the lack of holistic non-Lambertian benchmarks---most benchmarks have low scene and object diversity, and none provide multi-layer 3D annotations for objects occluded by transparent surfaces. In this paper, we introduce LayeredFlow, a real world benchmark containing multi-layer ground truth annotation for optical flow of non-Lambertian objects. Compared to previous benchmarks, our benchmark exhibits greater scene and object diversity, with 15k high quality optical flow and stereo pairs taken over 185 indoor and outdoor scenes and 360 unique objects. Using LayeredFlow as evaluation data, we propose a new task called multi-layer optical flow. To provide training data for this task, we introduce a large-scale densely-annotated synthetic dataset containing 60k images within 30 scenes tailored for non-Lambertian objects. Training on our synthetic dataset enable model to predict multi-layer optical flow, while fine-tuning existing optical flow methods on the dataset notably boosts their performance on non-Lambertian objects without compromising the performance on diffuse objects.

# 238

Strong Double Blind

Decomposition Betters Tracking Everything Everywhere

Rui Li · Dong Liu

Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions. Code and models will be available upon acceptance of the paper.

# 247

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Junpeng Jing · Ye Mao · Krystian Mikolajczyk

Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence's consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The code is available in supplemental material.

# 241

Strong Double Blind

Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update

Uday Kamal · Saibal Mukhopadhyay

Leveraging the high temporal resolution of an event-based camera requires highly efficient event-by-event processing. However, dense prediction tasks require explicit pixel-level association, which is challenging for event-based processing frameworks. Existing works aggregate the events into a static frame-like representation at the cost of a much slower processing rate and high compute cost. To address this challenge, this work introduces an event-based spatiotemporal representation learning framework for efficiently solving dense prediction tasks. We uniquely handle the sparse, asynchronous events using an unstructured, set-based approach and project them into a hierarchically organized multi-level latent memory space that preserves the pixel-level structure. Low-level event streams are dynamically encoded into these latent structures through an explicit attention-based spatial association. Unlike existing works that update these memory stacks at a fixed rate, we introduce a data-adaptive update rate that recurrently keeps track of the past memory states and learns to update the corresponding memory stacks only when it has substantial new information, thereby improving the overall compute latency. Our method consistently achieves competitive performance across different event-based dense prediction tasks while ensuring much lower latency compared to the existing methods.

# 236

Strong Double Blind

Towards Real-world Event-guided Low-light Video Enhancement and Deblurring

Taewoo Kim · Jaeseok Jeong · Hoonhee Cho · Yuhwan Jeong · Kuk-Jin Yoon

In low-light conditions, capturing videos with frame-based cameras often requires long exposure times, resulting in motion blur and reduced visibility. While frame-based motion deblurring and low-light enhancement have been studied, they still pose significant challenges. Event cameras have emerged as a promising solution for improving image quality in low-light environments and addressing motion blur. They provide two key advantages: capturing scene details well even in low light due to their high dynamic range, and effectively capturing motion information during long exposures due to their high temporal resolution. Despite efforts to tackle low-light enhancement and motion deblurring using event cameras separately, previous work has not addressed both simultaneously. To explore the joint task, we first establish real-world datasets for event-guided low-light enhancement and deblurring using a hybrid camera system based on beam splitters. Subsequently, we introduce an end-to-end framework to effectively handle these tasks. Our framework incorporates a module to efficiently leverage temporal information from events and frames. Furthermore, we propose a module to utilize cross-modal feature information to employ a low-pass filter for noise suppression while enhancing the main structural information. Our proposed method significantly outperforms existing approaches in addressing the joint task. Our project pages are available at https://github.com/intelpro/ELEDNet.

# 260

Understanding Physical Dynamics with Counterfactual World Modeling

Rahul Mysore Venkatesh · Honglin Chen · Kevin Feigelis · Daniel M Bear · Khaled Jedoui · Klemen Kotar · Felix J Binder · Wanhee Lee · Sherry Liu · Kevin Smith · Judith E. Fan · Daniel Yamins

The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective ``counterfactual prompting'' of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor in a zero-shot manner. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.

# 191

Strong Double Blind

Prompting Future Driven Diffusion Model for Hand Motion Prediction

Bowen Tang · Kaihao Zhang · Wenhan Luo · Wei Liu · HONGDONG LI

Hand motion prediction from both first- and third-person perspectives is vital for enhancing user experience in AR/VR and ensuring safe remote robotic arm control. Previous works typically focus on predicting hand motion trajectories or human body motion, with direct hand motion prediction remaining largely unexplored - despite the additional challenges posed by compact skeleton size. To address this, we propose a prompt-based Future Driven Diffusion Model (PromptFDDM) for predicting hand motion with guidance and prompts. Specifically, we develop a Spatial-Temporal Extractor Network (STEN) to predict hand motion with guidance, a Ground Truth Extractor Network (GTEN), and a Reference Data Generator Network (RDGN), which extract ground truth and substitute future data with generated reference data, respectively, to guide STEN. Additionally, interactive prompts generated from observed motions further enhance model performance. Experimental results on the FPHA and HO3D datasets demonstrate that the proposed PromptFDDM achieves state-of-the-art performance in both first- and third-person perspectives.

# 198

Nymeria: A Massive Collection of Egocentric Multi-modal Human Motion in the Wild

Lingni Ma · Yuting Ye · Rowan Postyeni · Alexander J Gamino · Vijay Baiyya · Luis Pesqueira · Kevin M Bailey · David Soriano Fosas · Fangzhou Hong · Vladimir Guzov · Yifeng Jiang · Hyo Jin Kim · Jakob Engel · Karen Liu · Ziwei Liu · Renzo De Nardi · Richard Newcombe

We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multi-modal egocentric devices. The dataset comes with a) full-body motion ground truth; b) egocentric multimodal recordings from Project Aria devices, including color, grayscale and eye-tracking cameras, IMUs, magnetometer, barometer, and multi-channel microphones; and c) an additional "observer" device providing a third-person viewpoint. We compute world-aligned 6DoF transformations for all sensors, across devices and capture sessions. The dataset also provides 3D scene point clouds and calibrated eye gaze. We derive a protocol to annotate hierarchical language descriptions of in-context human motion, from fine-grain dense body pose narrations, to simplified atomic actions and coarse activity summarization. To the best of our knowledge, Nymeria dataset is the world's largest collection of human motion in the wild with natural and diverse activities; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and also the world’s largest dataset of motion with language descriptions. It contains 1200 recordings of 300 hours daily activities from 264 participants 50 locations. The accumulated trajectory from participants is 399.2Km for the head and 1053.3Km for both wrists. To facilitate research, we define multiple research tasks in egocentric body tracking, motion synthesis, and action recognition. The performance of several state-of-the-art algorithms are reported. We will open-source the data and code to empower future exploration of the research community.

# 231

Motion Mamba: Efficient and Long Sequence Motion Generation

Zeyu Zhang · Akide Liu · Ian Reid · Richard Hartley · Bohan Zhuang · Hao Tang

Human motion generation stands as a significant pursuit in generative computer vision, while achieving long-sequence and efficient motion generation remains challenging. Recent advancements in state space models (SSMs), notably Mamba, have showcased considerable promise in long sequence modeling with an efficient hardware-aware design, which appears to be a promising direction to build motion generation model upon it. Nevertheless, adapting the SSMs to motion generation faces hurdles since the lack of specialized design architecture for modeling motion sequence. To address these multifaceted challenges, we introduce three key contributions. Firstly, we proposed Motion Mamba, an innovative yet straightforward approach that presents the pioneering motion generation model utilized SSMs. Secondly, we designed a Hierarchical Temporal Mamba (HTM) block to process temporal data by traversing through a symmetric architecture aimed at preserving motion consistency between frames. We also designed a Bidirectional Spatial Mamba (BSM) block to bidirectionally process latent poses, in order to enhance accurate motion generation within a temporal frame. Lastly, the proposed method has outperformed other well-established methods on the HumanML3D and KIT-ML datasets, which demonstrates strong capabilities of high-quality long sequence motion modeling and real-time human motion generation.

# 273

TLControl: Trajectory and Language Control for Human Motion Synthesis

WEILIN WAN · Zhiyang Dou · Taku Komura · Wenping Wang · Dinesh Jayaraman · Lingjie Liu

Controllable human motion synthesis is essential for applications in AR/VR, gaming and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a novel method for realistic human motion synthesis, incorporating both low-level Trajectory and high-level Language semantics controls, through an integration of neural-based and optimization-based techniques. Specifically, we begin with training a VQ-VAE for a compact and well-structured latent motion space organized by body parts. We then propose a Masked Trajectories Transformer~(MTT) for predicting a motion distribution conditioned on language and trajectory. Once trained, we use MTT to sample initial motion predictions given user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce a test-time optimization to refine these coarse predictions for precise trajectory control, which offers flexibility by allowing users to specify various optimization goals, and ensures high runtime efficiency. Comprehensive experiments show that TLControl significantly outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.

# 278

ParCo: Part-Coordinating Text-to-Motion Synthesis

Qiran Zou · Shangyuan Yuan · Shian Du · YU WANG · Chang Liu · Yi Xu · Jie Chen · Xiangyang Ji

We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. The code is available in the supplementary material.

# 277

BAMM: Bidirectional Autoregressive Motion Model

Ekkasit Pinyoanuntapong · Muhammad Usama Saleem · Pu Wang · Minwoo Lee · Srijan Das · Chen Chen

Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures.

# 263

Strong Double Blind

Pose Guided Fine-Grained Sign Language Video Generation

Tongkai Shi · Lianyu Hu · Fanhua Shang · Jichao Feng · liu peidong · Wei Feng

Sign language videos are an important medium for spreading and learning sign language. However, most existing human image synthesis methods produce sign language images with details that are distorted, blurred, or structurally incorrect. They also produce sign language video frames with poor temporal consistency, with anomalies such as flickering and abrupt detail changes between the previous and next frames. To address these limitations, we propose a novel Pose-Guided Motion Model (PGMM) for generating fine-grained and motion-consistent sign language videos. Firstly, we propose a new Coarse Motion Module (CMM), which completes the deformation of features by optical flow warping, thus transfering the motion of coarse-grained structures without changing the appearance; Secondly, we propose a new Pose Fusion Module (PFM), which guides the modal fusion of RGB and pose features, thus completing the fine-grained generation. Finally, we design a new metric, Temporal Consistency Difference (TCD) to quantitatively assess the degree of temporal consistency of a video by comparing the difference between the frames of the reconstructed video and the previous and next frames of the target video. Extensive qualitative and quantitative experiments show that our method outperforms state-of-the-art methods in most benchmark tests, with visible improvements in details and temporal consistency.

# 249

Strong Double Blind

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Liao Shen · Tianqi Liu · Huiqiang Sun · Xinyi Ye · Baopu Li · Jianming Zhang · Zhiguo Cao

We study the problem of generating intermediate images from image pairs with large motion while maintaining semantic consistency. Due to the large motion, the intermediate semantic information may be absent in input images. Existing methods either limit to small motion or focus on topologically similar objects, leading to artifacts and inconsistency in the interpolation results. To overcome this challenge, we delve into pre-trained image diffusion models for their capabilities in semantic cognition and representations, ensuring consistent expression of the absent intermediate semantic representations with the input. To this end, we propose DreamMover, a novel image interpolation framework with three main components: 1) A natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images. 2) To avoid the loss of detailed information during fusion, our key insight is to fuse information in two parts, high-level space and low-level space. 3) To enhance the consistency between the generated images and input, we propose the self-attention concatenation and replacement approach. Lastly, we present a challenging benchmark dataset called InterpBench to evaluate the semantic consistency of generated results. Extensive experiments demonstrate the effectiveness of our method. Code will be released soon.

# 276

Animate Your Motion: Turning Still Images into Dynamic Videos

Mingxiao Li · Bo Wan · Marie-Francine Moens · Tinne Tuytelaars

In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module~\cite{li2023gligen} and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.

# 282

Strong Double Blind

V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation

Pooja Guhan · Tsung-Wei Huang · Guan-Ming Su · Subhadra Gopalakrishnan · Dinesh Manocha

We introduce V-Trans4Style, an innovative algorithm tailored for dynamic video content editing needs. It is designed to adapt videos to different production styles like documentaries, dramas, feature films, or a specific YouTube channel's video-making technique. Our algorithm recommends optimal visual transitions to help achieve this flexibility using a more bottom-up approach. We first employ a transformer-based encoder-decoder network to learn recommending temporally consistent and visually seamless sequences of visual transitions using only the input videos. We then introduce a style conditioning module that leverages this model to iteratively adjust the visual transitions obtained from the decoder through activation maximization. We demonstrate the efficacy of our method through experiments conducted on our newly introduced AutoTransition++ dataset. It is a 6k video version of AutoTransition Dataset that additionally categorizes its videos into different production style categories. Our encoder-decoder model outperforms the state-of-the-art transition recommendation method, achieving improvements of 10% to 80% in Recall@K and mean rank values over baseline. Our style conditioning module results in visual transitions that improve the capture of the desired video production style characteristics by an average of around 12% in comparison to other methods when measured with similarity metrics. We hope that our work serves as a foundation for exploring and understanding video production styles further.

# 275

DragVideo: Interactive Drag-style Video Editing

Yufan Deng · Ruida Wang · Yuhao ZHANG · Yu-Wing Tai · Chi-Keung Tang

Video generation models have shown their superior ability to generate photo-realistic video. However, how to accurately control (or edit) the video remains a formidable challenge. The main issues are: 1) how to perform direct and accurate user control in editing; 2) how to execute editings like changing shape, expression, and layout without unsightly distortion and artifacts to the edited content; and 3) how to maintain spatio-temporal consistency of video after editing. To address the above issues, we propose DragVideo, a general drag-style video editing framework. Inspired by DragGAN, DragVideo addresses issues 1) and 2) by proposing the drag-style video latent optimization method which gives desired control by updating noisy video latent according to drag instructions through video-level drag objective function. We amend issue 3) by integrating the video diffusion model with sample-specific LoRA and Mutual Self-Attention in DragVideo to ensure the edited result is spatio-temporally consistent. We also present a series of testing examples for drag-style video editing and conduct extensive experiments across a wide array of challenging editing tasks, such as motion, skeleton editing, etc, underscoring DragVideo can edit video in an intuitive, faithful to the user's intention manner, with nearly unnoticeable distortion and artifacts, while maintaining spatio-temporal consistency. While traditional prompt-based video editing fails to do the former two and directly applying image drag editing fails in the last, DragVideo's versatility and generality are emphasized. Codes will be released.

# 274

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

Ming Tao · BINGKUN BAO · Hao Tang · Yaowei Wang · Changsheng Xu

Story visualization aims to generate a series of realistic and coherent images based on a storyline. Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner. Although these models have shown notable progress, there are still three flaws. 1) The unidirectional generation of auto-regressive manner restricts the usability in many scenarios. 2) The additional introduced story history encoders bring an extremely high computational cost. 3) The story visualization and continuation models are trained and inferred independently, which is not user-friendly. To these ends, we propose a bidirectional, unified, and efficient framework, namely StoryImager. The StoryImager enhances the storyboard generative ability inherited from the pre-trained text-to-image model for a bidirectional generation. Specifically, we introduce a Target Frame Masking Strategy to extend and unify different story image generation tasks. Furthermore, we propose a Frame-Story Cross Attention Module that decomposes the cross attention for local fidelity and global coherence. Moreover, we design a Contextual Feature Extractor to extract contextual information from the whole storyline. The extensive experimental results demonstrate the excellent performance of our StoryImager.

# 279

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

Haoyu Zhao · Tianyi Lu · Jiaxi Gu · Xing Zhang · Qingping Zheng · Zuxuan Wu · Hang Xu · Yu-Gang Jiang

The diffusion model is widely leveraged for either video generation or video editing. As each field has its task-specific problems, it is difficult to merely develop a single diffusion for completing both tasks simultaneously. Video diffusion sorely relying on the text prompt can be adapted to unify the two tasks. However, it lacks a high capability of aligning heterogeneous modalities between text and image, leading to various misalignment problems. In this work, we are the first to propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing. The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment. Particularly, the subject-driven alignment is put forward to trade off the image and text prompts, serving as a unified foundation generative model for both tasks. The adaptive prompts alignment is introduced to emphasize different strengths of homogeneous and heterogeneous alignments by assigning different values of weights to the image and the text prompts. The high-fidelity alignment is developed to further enhance the fidelity of both video generation and editing by taking the subject image as an additional model input. Experimental results on four benchmarks suggest that our method outperforms the previous method on each task.

# 309

Strong Double Blind

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

Gwanhyeong Koo · Sunjae Yoon · Ji Woo Hong · Chang Yoo

Current image editing methods primarily utilize DDIM Inversion, employing a two-branch diffusion approach to preserve the attributes and layout of the original image. However, these methods encounter challenges with non-rigid edits, which involve altering the image's layout or structure. Our comprehensive analysis reveals that the high-frequency components of DDIM latent, crucial for retaining the original image's key features and layout, significantly contribute to these limitations. Addressing this, we introduce FlexiEdit, which enhances fidelity to input text prompts by refining DDIM latent, specifically by reducing high-frequency components in targeted editing areas. FlexiEdit comprises two key components: (1) Latent Refinement, which modifies DDIM latent to better accommodate layout adjustments, and (2) Edit Fidelity Enhancement via Re-inversion, aimed at ensuring the edits more accurately reflect the input text prompts. Our approach represents notable progress in image editing, particularly in performing complex non-rigid edits, showcasing its enhanced capability through comparative experiments.

# 311

Lazy Diffusion Transformer for Interactive Image Editing

Yotam Nitzan · Zongze Wu · Richard Zhang · Eli Shechtman · Danny Cohen-Or · Taesung Park · Michaël Gharbi

We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently, targeting interactive image editing applications. Starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using a binary mask and a text prompt. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this global context, a diffusion-based decoder synthesizes the masked pixels in a ``lazy'' fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime and computation cost scale with the mask size, which is typically small for interactive edits. Since the diffusion process dominates the runtime and cost, our encoder introduces negligible overhead. Our approach amortizes the generation cost over several user interactions, making the editing experience more interactive, especially at high-resolution. We train LazyDiffusion on a large scale text-to-image dataset at 1024x1024 resolution. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a x10 speedup for typical user interactions, where the editing mask represents 10% of the image.

# 306

Strong Double Blind

WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians

Dmytro Kotovenko · Olga Grebenkova · Nikolaos Sarafianos · Avinash Paliwal · Pingchuan Ma · Omid Poursaeed · Sreyas Mohan · Yuchen Fan · Yilei Li · Rakesh Ranjan · Bjorn Ommer

While style transfer techniques have been well-developed for 2D image stylization, the extension of these methods to 3D scenes remains relatively unexplored. Existing approaches demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Scale (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover's Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Additionally, we decompose the scene stylization problem into smaller chunks to enhance efficiency. This paradigm shift reframes stylization from a pure generative process driven by latent space losses to an explicit matching of distributions between two Gaussian representations. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training, as it relies solely on optimization-based techniques.

# 316

Strong Double Blind

Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

Zipeng Qi · Guoxi Huang · Chenyang Liu · Fei Ye

This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We first introduce vision guidance as a foundational spatial cue within the perturbed distribution. This significantly refines the search space in a zero-shot paradigm to focus on the image sampling process adhering to the spatial layout conditions. To precisely control the spatial layouts of multiple visual concepts with the employment of vision guidance, we propose a universal framework, Layered Rendering Diffusion (LRDiff), which constructs an image-rendering process with multiple layers, each of which applies the vision guidance to instructively estimate the denoising direction for a single object. Such a layered rendering strategy effectively prevents issues like unintended conceptual blending or mismatches while allowing for more coherent and contextually accurate image synthesis. The proposed method offers a more efficient and accurate means of synthesising images that align with specific layout and contextual requirements. Through experiments, we demonstrate that our method outperforms existing techniques, both quantitatively and qualitatively, in two specific layout-to-image tasks: bounding box-to-image and instance mask-to-image. Furthermore, we extend the proposed framework to enable spatially controllable editing.

# 305

Strong Double Blind

Commonly Interesting Images

Fitim Abdullahu · Helmut Grabner

Images tell stories, trigger emotions, and let us recall memories -- they make us think. Thus, they have the ability to attract and hold one's attention, which is the definition of being ``interesting''. Yet, the appeal of an image is highly subjective. Looking at the image of my boy taking his first steps will always bring me back to this emotional moment, while it is just a blurry, quickly taken snapshot to most others. Preferences vary widely: some adore cats, others are dog enthusiasts, and a third group may not be fond of either. We argue that every image can be interesting to a particular observer under certain circumstances. This work particularly emphasizes subjective preferences. However, our analysis of 2.5k image collections from diverse users of the photo-sharing platform Flickr reveals that specific image characteristics make them commonly more interesting. For instance, images, including professionally taken landscapes, appeal broadly due to their aesthetic qualities. In contrast, subjectively interesting images, such as those depicting personal or niche community events, resonate on a more individual level, often evoking personal memories and emotions.

# 310

InstructGIE: Towards Generalizable Image Editing

Zichong Meng · Changdi Yang · Jun Liu · Hao Tang · Pu Zhao · Yanzhi Wang

Recent advances in image editing have been driven by the development of denoising diffusion models, marking a significant leap forward in this field. Despite these advances, the generalization capabilities of recent image editing approaches remain constrained. In response to this challenge, our study introduces a novel image editing framework with enhanced generalization robustness by boosting in-context learning capability and unifying language instruction. This framework incorporates a module specifically optimized for image editing tasks, leveraging the VMamba Block and an editing-shift matching strategy to augment in-context learning. Furthermore, we unveil a selective area-matching technique specifically engineered to address and rectify corrupted details in generated images, such as human facial features, to further improve the quality. Another key innovation of our approach is the integration of a language unification technique, which aligns language embeddings with editing semantics to elevate the quality of image editing. Moreover, we compile the first dataset for image editing with visual prompts and editing instructions that could be used to enhance in-context capability. Trained on this dataset, our methodology not only achieves superior synthesis quality for trained tasks, but also demonstrates robust generalization capability across unseen vision tasks through tailored prompts.

# 314

The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization

Jiafeng Mao · Xueting Wang · Kiyoharu Aizawa

Text-to-image diffusion models allow users control over the content of generated images. Still, text-to-image generation occasionally leads to generation failure requiring users to generate dozens of images under the same text prompt before they obtain a satisfying result. We formulate the lottery ticket hypothesis in denoising: randomly initialized Gaussian noise images contain special pixel blocks (winning tickets) that naturally tend to be denoised into specific content independently. The generation failure in standard text-to-image synthesis is caused by the gap between optimal and actual spatial distribution of winning tickets in initial noisy images. To this end, we implement semantic-driven initial image construction creating initial noise from known winning tickets for each concept mentioned in the prompt. We conduct a series of experiments that verify the properties of winning tickets and demonstrate that the winning tickets have generalizability across images and prompts. Our results show that aggregated winning tickets effectively induce the model to spontaneously generate the object at the corresponding location.

# 319

CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models

Nick Stracke · Stefan Andreas Baumann · Joshua Susskind · Miguel Angel Bautista · Bjorn Ommer

Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to take into account detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient and powerful approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.

# 326

Strong Double Blind

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Soyeong Kwon · TAEGYEONG LEE · Taehwan Kim

Text-based image editing and generation methods have diverse real-world applications. However, text-guided infinite image synthesis faces several challenges. First, there is a lack of text-image paired datasets with high-resolution and contextual diversity. Second, expanding images based on text requires global coherence and rich local context understanding. Previous studies have mainly focused on limited categories, such as natural landscapes, and also required to train on high-resolution images with paired text. To address these challenges, we propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding, without any high-resolution text-image paired training dataset. We train the diffusion model to expand an image conditioned on global and local captions generated from the LLM and visual feature. At the inference stage, given an image and a global caption, we use the LLM to generate a next local caption to expand the input image. Then, we expand the image using the global caption, generated local caption and the visual feature to consider global consistency and spatial local context. In experiments, our model outperforms the baselines both quantitatively and qualitatively. Furthermore, our model demonstrates the capability of text-guided arbitrary-sized image generation in zero-shot manner with LLM guidance.

# 317

Strong Double Blind

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Yifu Chen · Jingwen Chen · Yingwei Pan · Yehao Li · Ting Yao · Zhineng Chen · Tao Mei

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods.

# 329

Strong Double Blind

Customized Generation Reimagined: Fidelity and Editability Harmonized

Jian Jin · Yang Shen · Zhenyong Fu · Jian Yang

Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model, enabling new generations of the concept in novel contexts guided by textual prompts. However, customized generation suffers from an inherent trade-off between concept fidelity and editability, i.e., between precisely modeling the concept and faithfully adhering to the prompts. Previous methods reluctantly seek a compromise and struggle to achieve both high concept fidelity and ideal prompt alignment simultaneously. In this paper, we propose a "Divide, Conquer, then Integrate" (DCI) framework, which performs a surgical adjustment in the early stage of denoising to liberate the fine-tuned model from the fidelity-editability trade-off at inference.The two conflicting components in the trade-off are decoupled and individually conquered by two collaborative branches, which are then selectively integrated to preserve high concept fidelity while achieving faithful prompt adherence. To obtain a better fine-tuned model, we introduce an Image-specific Context Optimization (ICO) strategy for model customization. ICO replaces manual prompt templates with learnable image-specific contexts, providing an adaptive and precise fine-tuning direction to promote the overall performance. Extensive experiments demonstrate the effectiveness of our method in reconciling the fidelity-editability trade-off, particularly for generations with weak model priors. Code has made anonymously available at https://anonymous.4open.science/r/DCI_ICO-90C2

# 331

Strong Double Blind

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement

Muhammad Atif Butt · Kai Wang · Javier Vazquez-Corral · Joost Van de Weijer

Text-to-Image (T2I) generation has made significant advancements with the advent of diffusion models. These models exhibit remarkable ability to produce images based on textual prompts. Current T2I models allow users to specify object colors using linguistic color names. However, these labels encompass broad color ranges, making it difficult to achieve precise color matching. To tackle this challenging task, named as color prompt learning, we propose to learn specific color prompts tailored to user-selected colors. These prompts are finally employed to generate objects with the exact desired colors. Observing the existing T2I adaptation approaches cannot achieve satisfactory performance, we propose to generate basic geometric objects in the target color. Leveraging color and shape disentanglement, our method, denoted as ColorPeel, successfully assists the T2I models to peel off the novel color prompts from these colored shapes. In the experiments, we demonstrate the efficacy of ColorPeel in achieving precise color generation with T2I models and generalize ColorPeel to effectively learn abstract attribute concepts, including textures, materials, etc. Our findings provide a valuable step towards improving precision and versatility of T2I models, offering new opportunities for creative applications and design tasks.

# 335

Strong Double Blind

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

Sogand Salehi · Mahdi Shafiei · Roman Bachmann · Teresa Yeo · Amir Zamir

Personalized image generation involves creating images aligned with an individual’s visual preference. Current generative models are, however, tuned to produce outputs that appeal to a broad audience, and personalization to individual users' visual preferences relies on iterative and manual prompt engineering by the user, which is neither time-efficient nor scalable. We propose to personalize the image generation process by first inviting users to comment on a small selection of images, explaining why they like or dislike each. Based on these comments, we infer a user’s liked and disliked visual attributes, i.e., their visual preference, using a large language model. These attributes are used to guide a text-to-image model toward producing images that are personalized towards the individual user's visual preference. Through a series of user tests and large language model guided evaluations, we demonstrate that our proposed method results in generations that are well aligned with individual users' visual preferences.

# 320

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

Yang Zhao · Zhisheng Xiao · Yanwu Xu · Haolin Jia · Tingbo Hou

The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and high latency. In this paper, we present \textbf{MobileDiffusion}, an ultra-efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to minimize model size and FLOPs, while preserving image generation quality. Additionally, we revisit the advanced sampling technique by diffusion-GAN, and make one-step sampling compatible to downstream applications trained on the base model. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed technologies. With them, MobileDiffusion achieves instant text-to-image generation on mobile devices, establishing a new state of the art.

# 312

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

Yuxiang WEI · Zhilong Ji · Jinfeng Bai · Hongzhi Zhang · Yabin Zhang · Wangmeng Zuo

Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code will be made publicly available.

# 313

Strong Double Blind

Towards Reliable Advertising Image Generation Using Human Feedback

Zhenbang Du · Wei Feng · Haohan Wang · Yaoyu Li · Jingsen Wang · Jian Li · Zheng Zhang · Jingjing Lv · Xin Zhu · Junsheng Jin · Junjie Shen · Zhangang Lin · Jingping Shao

In the e-commerce realm, compelling advertising images are pivotal for attracting customer attention. While generative models automate image generation, they often produce substandard images that may mislead customers and require significant labor costs to inspect. This paper delves into increasing the rate of available generated images. We first introduce a multi-modal Reliable Feedback Network (RFNet) to automatically inspect the generated images. Combining the RFNet into a recurrent process, Recurrent Generation, results in a higher number of available advertising images. To further enhance production efficiency, we fine-tune diffusion models with an innovative Consistent Condition regularization utilizing the feedback from RFNet (RFFT). This results in a remarkable increase in the available rate of generated images, reducing the number of attempts in Recurrent Generation, and providing a highly efficient production process without sacrificing visual appeal. We also construct a Reliable Feedback 1 Million (RF1M) dataset which comprises over one million generated advertising images annotated by humans, which helps to train RFNet to accurately assess the availability of generated images and faithfully reflect the human feedback. Generally speaking, our approach offers a reliable solution for advertising image generation. The dataset and code will be released after acceptance.

# 328

IMMA: Immunizing text-to-image Models against Malicious Adaptation

Amber Yijia Zheng · Raymond Yeh

Advancements in open-sourced text-to-image models and fine-tuning methods have led to the increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful/unauthorized content. Recent works, e.g., Glaze or MIST, have developed data-poisoning techniques which protect the data against adaptation methods. In this work, we consider an alternative paradigm for protection. We propose to ``immunize'' the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content; in short IMMA. Specifically, IMMA should be applied before the release of the model weights to mitigate these risks. Empirical results show IMMA's effectiveness against malicious adaptations, including mimicking the artistic style and learning of inappropriate/unauthorized content, over three adaptation methods: LoRA, Textual-Inversion, and DreamBooth.

# 315

Strong Double Blind

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

Rishubh Parihar · Sachidanand VS · Sabariswaran Mani · Tejan Karmali · Venkatesh Babu Radhakrishnan

Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled $\mathcal{W+}$ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the $\mathcal{W+}$ space, we train a latent mapper to translate latent codes from $\mathcal{W+}$ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.

# 298

Strong Double Blind

AddMe: Zero-shot Group-photo Synthesis by Inserting People into Scenes

Dongxu Yue · Maomao Li · Yunfei Liu · AILING ZENG · Tianyu Yang · Qin Guo · Yu Li

While large text-to-image diffusion models have made significant progress in high-quality image generation, challenges persist when users insert their portraits into existing photos, especially group photos. Concretely, existing customization methods struggle to insert facial identities at desired locations in existing images, and it is difficult for existing local image editing methods to deal with facial details. To address these limitations, we propose AddMe, a powerful diffusion-based portrait generator that can insert a given portrait into a desired location in an existing scene image in a zero-shot manner. Specifically, we propose a novel identity adapter to learn a facial representation decoupled from existing characters in the scene. Meanwhile, to ensure that the generated portrait can interact properly with others in the existing scene, we design an enhanced portrait attention module to capture contextual information during the generation process. Our method is compatible with both text and various spatial conditions, enabling precise control over the generated portraits. Extensive experiments demonstrate significant improvements in both performance and efficiency.

# 299

Strong Double Blind

UniProcessor: A Text-induced Unified Low-level Image Processor

Huiyu Duan · Xiongkuo Min · Sijing Wu · Wei Shen · Guangtao Zhai

Image processing, including image restoration, image enhancement, etc., involves generating a high-quality clean image from a degraded input. Deep learning-based methods have shown superior performance for various image processing tasks in terms of single-task conditions. However, they require to train separate models for different degradations and levels, which limits the generalization abilities of these models and restricts their applications in real-world. In this paper, we propose a text-induced Unified image Processor for low-level vision tasks, termed UniProcessor, which can effectively process various degradation types and levels, and support multimodal control. Specifically, our UniProcessor encodes degradation-specific information with the subject prompt and process degradations with the manipulation prompt. These context control features are injected into the UniProcessor backbone via cross-attention to control the processing procedure. For automatic subject-prompt generation, we further build a vision-language model for general-purpose low-level degradation perception via instruction tuning techniques. Our UniProcessor covers 30 degradation types, and extensive experiments demonstrate that our UniProcessor can well process these degradations without additional training or tuning and outperforms other competing methods. Moreover, with the help of degradation-aware context control, our UniProcessor first shows the ability to individually handle a single distortion in an image with multiple degradations.

# 35

Strong Double Blind

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

Xiao Liu · Xiaoliu Guan · Yu Wu · Jiaxu Miao

Diffusion models, known for their tremendous ability to generate novel and high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent approaches for memory mitigation either only focused on the text modality problem in cross-modal generation tasks or utilized data augmentation strategies. In this paper, we propose a novel training framework for Diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. To facilitate ``forgetting'' of stored information in Diffusion model parameters, we propose an iterative ensemble training strategy by splitting the data into multiple shards for training multiple models and intermittently aggregating these model parameters. Moreover, practical analysis on losses illustrates that the training loss for easily memorable images tends to be obviously lower. Thus, we propose an anti-gradient control method to exclude the sample with a lower loss value from the current mini-batch to avoid memorizing. Extensive experiments and analysis on three datasets are conducted to illustrate the effectiveness of our method, and results show that our method successfully reduces memory capacity while even improving the performance slightly. Moreover, to save the computing cost, we successfully apply our method to fine-tune the well-trained Diffusion models by limited epochs, demonstrating the applicability of our method.

# 301

Strong Double Blind

EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models

Lee Eungbean · Somi Jeong · Kwanghoon Sohn

Exemplar-guided image translation, synthesizing photo-realistic images that conform to both structural control and style exemplars, is attracting attention due to its ability to enhance user control over style manipulation. Previous methodologies have predominantly depended on establishing dense correspondences across cross-domain inputs. Despite these efforts, they incur quadratic memory and computational costs for establishing dense correspondence, resulting in limited versatility and performance degradation. In this paper, we propose a novel approach termed Exemplar-guided Image Translation with Brownian-Bridge Diffusion Models (EBDM). Our method formulates the task as a stochastic Brownian bridge process, a diffusion process with a fixed initial point as structure control and translates into the corresponding photo-realistic image while being conditioned solely on the given exemplar image. To efficiently guide the diffusion process toward the style of exemplar, we delineate three pivotal components: the Global Encoder, the Exemplar Network, and the Exemplar Attention Module to incorporate global and detailed texture information from exemplar images. Leveraging Bridge diffusion, the network is able to translate images from structure control while exclusively conditioned on the exemplar style, leading to more robust training and inference processes. We illustrate the superiority of our method over competing approaches through comprehensive benchmark evaluations and visual results.

# 32

Strong Double Blind

Assessing Sample Quality via the Latent Space of Generative Models

Jingyi Xu · Hieu Le · Dimitris Samaras

Advances in generative models increase the need for sample quality assessment. To do so, previous methods rely on a pre-trained feature extractor to embed the generated samples and real samples into a common space for comparison. However, different feature extractors might lead to inconsistent assessment outcomes. Moreover, these methods are not applicable for domains where a robust, universal feature extractor does not yet exist, such as medical images or 3D assets. In this paper, we propose to directly examine the latent space of the trained generative model to infer generated sample quality. This is feasible because the quality a generated sample directly relates to the amount of training data resembling it, and we can infer this information by examining the density of the latent space. Accordingly, we use a latent density score function to quantify sample quality. We show that the proposed score correlates highly with the sample quality for various generative models including VAEs, GANs and Latent Diffusion Models. Compared with previous quality assessment methods, our method has the following advantages: 1) pre-generation quality estimation with reduced computational cost, 2) generalizability to various domains and modalities, and 3) applicability to latent-based image editing and generation methods. Extensive experiments demonstrate that our proposed methods can benefit downstream tasks such as few-shot image classification and latent face image editing.

# 36

Strong Double Blind

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

Alireza Ganjdanesh · Yan Kang · Yuchen Liu · Richard Zhang · Zhe Lin · Heng Huang

Diffusion probabilistic models can generate high-quality samples. Yet, their sampling process requires numerous denoising steps, making it slow and computationally intensive. We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts. First, we study the similarities between pairs of denoising timesteps, observing a natural clustering, even across different datasets. This suggests that rather than having a single model for all time steps, separate models can serve as ``experts'' for their respective time intervals. As such, we separately fine-tune the pretrained model on each interval, with elastic dimensions in depth and width, to obtain experts specialized in their corresponding denoising interval. To optimize the resource usage between experts, we introduce our Expert Routing Agent, which learns to select a set of proper network configurations. By doing so, our method can allocate the computing budget between the experts in an end-to-end manner without requiring manual heuristics. Finally, with a selected configuration, we fine-tune our pruned experts to obtain our mixture of efficient experts. We demonstrate the effectiveness of our method, DiffPruning, across several datasets, LSUN-Church, LSUN-Bedroom, FFHQ, and ImageNet, on the Latent Diffusion Model architecture.

# 23

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Nanye Ma · Mark Goldstein · Michael Albergo · Nicholas M Boffi · Eric Vanden-Eijnden · Saining Xie

We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet $256 \times 256$ and $512 \times 512$ benchmark using the exact same model structure, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06 and 2.62, respectively.

# 242

Strong Double Blind

Efficient Training with Denoised Neural Weights

Yifan Gong · Zheng Zhan · Yanyu Li · Yerlan Idelbayev · Andrey Zharkov · Kfir Aberman · Sergey Tulyakov · Yanzhi Wang · Jian Ren

Good weight initialization serves as an effective measure to reduce the training cost of a deep neural network (DNN) model. The choice of how to initialize parameters is challenging and may require manual tuning, which can be time-consuming and prone to human error. To overcome such limitations, this work takes a novel step towards building a weight generator to synthesize the neural weights for initialization. We use the image-to-image translation task with generative adversarial networks (GANs) as an example due to the ease of collecting model weights spanning a wide range. Specifically, we first collect a dataset with various image editing concepts and their corresponding trained weights, which are later used for the training of the weight generator. To address the different characteristics among layers and the substantial number of weights to be predicted, we divide the weights into equal-sized blocks and assign each block an index. Subsequently, a diffusion model is trained with such a dataset using both text conditions of the concept and the block indexes. By initializing the image translation model with the denoised weights predicted by our diffusion model, the training requires only 43.3 seconds. Compared to training from scratch (i.e., Pix2pix), we achieve a 15X training time acceleration for a new concept while obtaining even better image generation quality.

# 295

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Linjiang Huang · Rongyao Fang · Aiping Zhang · Guanglu Song · Si Liu · Yu Liu · Hongsheng LI

In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. To address this issue, we introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation, intending to achieve structural consistency and scale consistency across resolutions, respectively. Further enhanced by a padding-then-crop strategy, our method can flexibly handle text-to-image generation of various aspect ratios. By using the FouriScale as guidance, our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation without whistle and bell. With its simplicity and compatibility, our approach can provide valuable insights for future explorations into the synthesis of ultra-high-resolution images.

# 327

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Junhao Zhuang · Yanhong Zeng · WENRAN LIU · Chun Yuan · Kai Chen

Achieving high-quality versatile image inpainting, wherein user-specified regions are seamlessly filled with plausible content based on user intent, presents a significant challenge. Existing methodologies typically tackle this challenge by training separate models for distinct repair tasks, such as context-aware image inpainting and text-guided object inpainting, due to the need for different optimal training strategies. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in both tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model's focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Moreover, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting, enhancing the model's applicability in shape-guided applications. Finally, we extensively evaluate PowerPaint on various inpainting benchmarks to demonstrate its superior performance for versatile image inpainting. We will release the codes and models publicly, facilitating further research in the field.

# 253

Strong Double Blind

Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

Zizheng Yang · Hu Yu · Bing Li · Jinghao Zhang · Jie Huang · Feng Zhao

Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and degradation characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods.

# 243

Strong Double Blind

DSMix: Distortion-Induced Saliency Map Based Pre-training for No-Reference Image Quality Assessment

Jinsong Shi · Jinsong Shi · Xiaojiang Peng · Jie Qin

Image quality assessment (IQA) has long been a fundamental challenge in image understanding. In recent years, deep learning-based IQA methods have shown promising performance. However, the lack of large amounts of labeled data in the IQA field has hindered further advancements in these methods. This paper introduces DSMix, a novel data augmentation technique specifically designed for IQA tasks, aiming to overcome this limitation. DSMix leverages the distortion-induced sensitivity map (DSM) of an image as prior knowledge. It applies cut and mix operations to diverse categories of synthetic distorted images, assigning confidence scores to class labels based on the aforementioned prior knowledge. In the pre-training phase using DSMix-augmented data, knowledge distillation is employed to enhance the model's ability to extract semantic features. Experimental results on both synthetic and authentic IQA datasets demonstrate the significant predictive and generalization performance achieved by DSMix, without requiring fine-tuning of the full model. The code will be made publicly available.

# 244

DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior

Xinqi Lin · Jingwen He · Ziyan Chen · Zhaoyang Lyu · Bo Dai · Fanghua Yu · Yu Qiao · Wanli Ouyang · Chao Dong

We present DiffBIR, a two-stage restoration pipeline that handles blind image restoration tasks in a unified framework. In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results. For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details. Specifically, IRControlNet is trained based on specially produced condition images without distracting noisy content for stable generation performance. Moreover, we design a region-adaptive restoration guidance that can modify the denoising process during inference without model re-training, allowing users to balance realness and fidelity through a tunable guidance scale. Extensive experiments have demonstrated DiffBIR's superiority over state-of-the-art approaches for blind image super-resolution, blind face restoration and blind image denoising tasks on both synthetic and real-world datasets.

# 235

Strong Double Blind

Restoring Images in Adverse Weather Conditions via Histogram Transformer

Shangquan Sun · Wenqi Ren · Xinwei Gao · Rui Wang · Xiaochun Cao

Transformer-based image restoration methods in adverse weather have achieved significant progress. Most of them use self-attention along channel dimension or within spatially fixed-range blocks to reduce computational load. However, such a compromise results in limitations in capturing long-range spatial features. Inspired by the observation that the weather-induced degradation factors mainly cause similar occlusion and brightness, in this work, we propose an efficient Histogram Transformer (Histoformer) for restoring images affected by adverse weather. It is powered by a new mechanism dubbed histogram self-attention, which sorts and segments spatial features into intensity-based bins. Self-attention is then applied across bins or within each bin to selectively focus on spatial features of dynamic range and process similar degraded pixels of long range together. To boost histogram self-attention, we present a dynamic-range convolution enabling conventional convolution to conduct operation over similar pixels rather than neighbor pixels. We also observe that the common pixel-wise losses neglect linear association and correlation between output and ground-truth. Thus, we propose to leverage the Pearson correlation coefficient as a loss function to enforce the recovered pixels following the identical order as ground-truth. Extensive experiments demonstrate the efficacy and superiority of our proposed method.

# 246

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

Mehdi Noroozi · Isma Hadji · Brais Martinez · Adrian Bulat · Georgios Tzimiropoulos

In this paper, we introduce YONOS-SR, a novel stable diffusion based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. Specifically, we propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, especially with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.

# 240

Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution

Junxiong Lin · Yan Wang · Zeng Tao · Boyang Wang · Qing Zhao · Haoran Wang · Xuan Tong · Xinji Mai · Yuxuan Lin · Wei Song · Jiawen Yu · Shaoqi Yan · Wenqiang Zhang

Pre-trained diffusion models utilized for image generation encapsulate a substantial reservoir of a priori knowledge pertaining to intricate textures. Harnessing the potential of leveraging this a priori knowledge in the context of image super-resolution presents a compelling avenue. Nonetheless, prevailing diffusion-based methodologies presently overlook the constraints imposed by degradation information on the diffusion process. Furthermore, these methods fail to consider the spatial variability inherent in the estimated blur kernel, stemming from factors such as motion jitter and out-of-focus elements in open-environment scenarios. This oversight results in a considerable deviation of the image super-resolution effect from fundamental realities. To address these concerns, we introduce a framework known as Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution (SSR). Within the SSR framework, we propose a Spatially Variant Kernel Refinement (SVKR) module. SVKR estimates a Depth-Informed Kernel, which takes the depth information into account and is spatially variant. Additionally, SVKR enhance the accuracy of depth information acquired from LR images, allowing for mutual enhancement between the depth map and blur kernel estimates. Finally, we introduce the Adaptive Multi-Modal Fusion Module (AMF) to align the information from three modalities: low-resolution images, depth maps, and blur kernels. This alignment can constrain the diffusion model to generate more authentic SR results. Quantitative and qualitative experiments affirm the superiority of our approach, while ablation experiments corroborate the effectiveness of the modules we have proposed.

# 245

Strong Double Blind

Efficient Cascaded Multiscale Adaptive Network for Image Restoration

Yichen Zhou · Pan Zhou · Teck Khim Ng

Image restoration, encompassing tasks such as deblurring, denoising, and super-resolution, remains a pivotal area in computer vision. However, efficiently addressing the spatially varying artifacts of various low-quality images with local adaptiveness and handling their degradations at different scales poses significant challenges. To efficiently tackle these issues, we propose the novel \textit{Efficient Cascaded Multiscale Adaptive} (ECMA) Network. ECMA employs Local Adaptive Module, LAM, which dynamically adjusts convolution kernels across local image regions to efficiently handle varying artifacts. Thus, LAM addresses the local adaptiveness challenge more efficiently than costlier mechanisms like self-attention, due to its less computationally intensive convolutions. To construct a basic ECMA block, three cascading LAMs with convolution kernels from large to small sizes are employed to capture features at different scales. This cascaded multiscale learning effectively handles degradations at different scales, critical for diverse image restoration tasks. Finally, ECMA blocks are stacked in a U-Net architecture to build ECMA networks, which efficiently achieve both local adaptiveness and multiscale processing. Experiments show ECMA's high performance and efficiency, achieving comparable or superior restoration performance to state-of-the-art methods while reducing computational costs by 1.2$\times$ to 9.7$\times$ across various image restoration tasks, e.g., image deblurring, denoising and super-resolution. Our code and models will be released.

# 257

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Kihong Kim · Haneol Lee · Jihye Park · Seyeon Kim · Kwang Hee Lee · Seungryong Kim · Jaejun Yoo

Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).

# 252

Strong Double Blind

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Wei Shang · Dongwei Ren · Wanying Zhang · Yuming Fang · Wangmeng Zuo · Kede Ma

Arbitrary video super-resolution is a challenging task that involves generating high-resolution videos of arbitrary sizes from low-resolution videos while preserving fine details and ensuring temporal consistency between consecutive frames. In this study, we present ArbVSR, an efficient and effective framework for arbitrary-scale video super-resolution. Specifically, ArbVSR builds upon flow-guided recurrent unit to capture temporal dependency and utilizes local window aggregation to exploit future frames. To better leverage scale information, we generate spatially varying maps via all stages in pre-trained deep neural networks as structural and textural priors, which can identify regions with a high probability of containing texture. The priors effectively guide the super-resolution process to generate more visually pleasing and accurate results across different scale factors. In the upsampling phase, we propose a scale-sensitive and data-independent hypernetwork to generate continuous upsampling weights for arbitrary-scale video super-resolution, which can be computed during pre-processing to improve efficiency. Extensive experiments demonstrate the significant advantages of our method in terms of both performance and efficiency. The source code and trained models will be publicly available.

# 284

Taming Lookup Tables for Efficient Image Retouching

Sidi Yang · Binxiao Huang · Mingdeng Cao · Yatai Ji · Hanzhong Guo · Ngai Wong · Yujiu Yang

The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To this end, we propose Image Color Enhancement LookUp Table (ICELUT) that adopts LUTs for extremely efficient edge inference, without any convolutional neural network (CNN). During training, we leverage pointwise (1×1) convolution to extract color information, alongside a split fully connected layer to incorporate global information. Both components are then seamlessly converted into LUTs for hardware-agnostic deployment. ICELUT achieves near-state-of-the-art performance and remarkably low power consumption. We observe that the pointwise network structure exhibits robust scalability, upkeeping the performance even with a heavily downsampled 32×32 input image. These enable ICELUT, the first-ever purely LUT-based image enhancer, to reach an unprecedented speed of 0.5ms on GPU and 7ms on CPU, at least one order faster than any CNN solution.

# 251

Strong Double Blind

Quanta Video Restoration

PRATEEK CHENNURI · Yiheng Chi · Enze Jiang · GM Dilshan Godaliyadda · Abhiram Gnanasambandam · Hamid R Sheikh · Istvan Gyongy · Stanley Chan

The proliferation of single-photon image sensors has opened the door to a plethora of high-speed and low-light imaging applications. However, data collected by these sensors are often 1-bit or few-bit, and corrupted by noise and strong motion. Conventional video restoration methods are not designed to handle this situation, while specialized quanta burst algorithms have limited performance when the number of input frames is low. In this paper, we introduce Quanta Video Restoration (QUIVER), an end-to-end trainable network built on the core ideas of classical quanta restoration methods, i.e., pre-filtering, flow estimation, fusion, and refinement. We also collect and publish I2-2000FPS, a high-speed video dataset with the highest temporal resolution of 2000 frames-per-second, for training and testing. On simulated and real data, QUIVER outperforms existing quanta restoration methods by a significant margin. Code and dataset available at https://github.com/chennuriprateek/QuantaVideoRestoration-QUIVER-

# 258

Strong Double Blind

Two-Stage Video Shadow Detection via Temporal-Spatial Adaption

Xin Duan · Yu Cao · Lei Zhu · Gang Fu · Xin WANG · Renjie ZHANG · Ping Li

Video Shadow Detection (VSD) is an important computer vision task focusing on detecting and segmenting shadows throughout the entire video sequence. Despite their remarkable performance, existing VSD methods and datasets mainly focus on the dominant and isolated shadows. Consequently, VSD under complex scenes is still an unexplored challenge. To address this issue, we built a new dataset, Complex Video Shadow Dataset (CVSD), which contains 196 video clips including 19,757 frames with complex shadow patterns, to enhance the practical applica- bility of VSD. We propose a two-stage training paradigm and a novel network to handle complex dynamic shadow scenarios. Regarding the complex video shadow detection as conditioned feature adaption, we pro- pose temporal- and spatial-adaption blocks for incorporating temporal information and attaining high-quality shadow detection, respectively. To the best of our knowledge, we are the first to construct the dataset and model tailored for the complex VSD task. Experimental results show the superiority of our model over state-of-the-art VSD methods. Our project will be publicly available at: https://hizuka590.github.io/CVSD.

# 239

Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework

Jingjing Zheng · Wanglong Lu · Wenzhe Wang · Yankai Cao · Xiaoqin Zhang · Xianta Jiang

Recently, numerous tensor singular value decomposition (t-SVD)-based tensor recovery methods have shown promise in processing visual data, such as color images and videos. However, these methods often suffer from severe performance degradation when confronted with tensor data exhibiting non-smooth changes. It has been commonly observed in real-world scenarios but ignored by the traditional t-SVD-based methods. In this work, we introduce a novel tensor recovery model with a learnable tensor nuclear norm to address such a challenge. We develop a new optimization algorithm named the Alternating Proximal Multiplier Method (APMM) to iteratively solve the proposed tensor completion model. Theoretical analysis demonstrates the convergence of the proposed APMM to the Karush–Kuhn–Tucker (KKT) point of the optimization problem. In addition, we propose a multi-objective tensor recovery framework based on APMM to efficiently explore the correlations of tensor data across its various dimensions, providing a new perspective on extending the t-SVD-based method to higher-order tensor cases. Numerical experiments demonstrated the effectiveness of the proposed method in tensor completion.

# 62

Strong Double Blind

Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic Imaging

Wenhua Wu · Kun Hu · Wenxi Yue · Wei Li · Milena Simic · Changyang Li · Wei Xiang · Zhiyong Wang

Knee osteoarthritis (KOA), a common form of arthritis that causes physical disability, has become increasingly prevalent in society, especially among the elders. Employing computer-aided techniques to automatically assess the severity and progression of KOA can be greatly beneficial for KOA treatment and disease management. Particularly, the advancement of X-ray technology and its application in KOA demonstrate its potential for this purpose. Yet, existing X-ray prognosis research generally yields a singular progression severity grade, overlooking the potential visual changes for understanding and explaining the progression outcome. Therefore, in this study, a novel deep generative model is proposed, namely Identity-Consistent Radiographic Diffusion Network (IC-RDN), for multifaceted KOA prognosis encompassing a predicted future knee X-ray scan conditioned on the baseline scan and a future KOA severity grade. Specifically, an identity prior module for the diffusion and a down-stream generative-guided progression prediction module are introduced. Compared to a conventional image-to-image generative model, identity priors regularize and guide the diffusion to focus more on the clinical nuances related to the prognosis, based on a contrastive learning strategy. The progression prediction module utilizes both forecasted and baseline knee scans, and a more comprehensive formulation of KOA severity progression grading is expected. Extensive experiments on a widely used public dataset, OAI, demonstrate the effectiveness of the proposed method.

# 71

NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration

Lin Tian · Thomas H Greer · Raul San Jose Estepar · Roni Sengupta · Niethammer Marc

This work proposes NePhi, a generalizable neural deformation model which results in approximately diffeomorphic transformations. In contrast to the predominant voxel-based transformation fields used in learning-based registration approaches, NePhi represents deformations functionally, leading to great flexibility within the design space of memory consumption during training and inference, inference time, registration accuracy, as well as transformation regularity. Specifically, NePhi 1) requires less memory compared to voxel-based learning approaches, 2) improves inference speed by predicting latent codes, compared to current existing neural deformation based registration approaches that only rely on optimization, 3) improves accuracy via instance optimization, and 4) shows excellent deformation regularity which is highly desirable for medical image registration. We demonstrate the performance of NePhi on a 2D synthetic dataset as well as for real 3D lung registration. Our results show that NePhi can match the accuracy of voxel-based representations in a single-resolution registration setting. For multi-resolution registration, our method matches the accuracy of current SOTA learning-based registration approaches with instance optimization while reducing memory requirements by a factor of five.

# 232

Strong Double Blind

Neural Metamorphosis

Xingyi Yang · Xinchao Wang

This paper introduces a novel paradigm termed \textbf{Neural Metamorphosis}~(\textbf{NeuMeta}), which aims to represent a continuous family of networks within a single versatile model. Unlike traditional methods that rely on separate models for different network tasks or sizes, NeuMeta enables an expansive continuum of neural networks that readily morph to fit various needs. The core mechanism is to train a neural implicit function that takes the desired network size and parameter coordinates as inputs, and generates exact corresponding weight values without requiring separate models for different configurations. Specifically, to achieve weight smoothness in a single model, we address the Shortest Hamiltonian Path problem within each neural clique graph. We maintain cross-model consistency by incorporating input noise during training. As such, NeuMeta may dynamically create arbitrary network parameters during the inference stage by sampling on the weight manifold. NeuMeta shows promising results in synthesizing parameters for unseen network configurations. Our extensive tests in image classification, semantic segmentation, and image generation reveal that NeuMeta sustains full-size performance even at a 75% compression rate.

# 254

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Zefan Qu · Xinyang Jiang · Yifan Yang · Dongsheng Li · cairong zhao

Low latency rates are crucial for online video-based applications, such as video conferencing and cloud gaming, which make improving video quality in online scenarios increasingly important. However, existing quality enhancement methods are limited by slow inference speed and the requirement for temporal information contained in future frames, making it challenging to deploy them directly in online tasks. In this paper, we propose a novel method, STLVQE, specifically designed to address the rarely studied online video quality enhancement (Online-VQE) problem. Our STLVQE designs a new VQE framework which contains a Module-Agnostic Feature Extractor that greatly reduces the redundant computations and redesign the propagation, alignment, and enhancement module of the network. A Spatial-Temporal Look-up Tables (STL) is proposed, which extracts spatial-temporal information in videos while saving substantial inference time. To the best of our knowledge, we are the first to exploit the LUT structure to extract temporal information in video tasks. Extensive experiments on the MFQE 2.0 dataset demonstrate that our STLVQE achieves a satisfactory performance-speed trade-off.

# 120

EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks

Ziming Wang · Ziling Wang · Huaning Li · Lang Qin · Runhao Jiang · De Ma · Huajin Tang

Event cameras, with their high dynamic range and temporal resolution, are ideally suited for object detection, especially under scenarios with motion blur and challenging lighting conditions. However, while most existing approaches prioritize optimizing spatiotemporal representations with advanced detection backbones and early aggregation functions, the crucial issue of adaptive event sampling remains largely unaddressed. Spiking Neural Networks (SNNs), which operate on an event-driven paradigm through sparse spike communication, emerge as a natural fit for addressing this challenge. In this study, we discover that the neural dynamics of spiking neurons align closely with the behavior of an ideal temporal event sampler. Motivated by this insight, we propose a novel adaptive sampling module that leverages recurrent convolutional SNNs enhanced with temporal memory, facilitating a fully end-to-end learnable framework for event-based detection. Additionally, we introduce Residual Potential Dropout (RPD) and Spike-Aware Training (SAT) to regulate potential distribution and address performance degradation encountered in spike-based sampling modules. Through rigorous testing on neuromorphic datasets for event-based detection, our approach demonstrably surpasses existing state-of-the-art spike-based methods, achieving superior performance with significantly fewer parameters and time steps. For instance, our method achieves a 4.4\% mAP improvement on the Gen1 dataset, while requiring 38\% fewer parameters and three time steps. Moreover, the applicability and effectiveness of our adaptive sampling methodology extend beyond SNNs, as demonstrated through further validation on conventional non-spiking detection models.

# 1

Strong Double Blind

LaWa: Using Latent Space for In-Generation Image Watermarking

Ahmad Rezaei · Mohammad Akbari · Saeed Ranjbar Alvar · Arezou Fatemi · Yong Zhang

With generative models producing high quality images that are indistinguishable from real ones, there is growing concern regarding the malicious usage of AI-generated images. Imperceptible image watermarking is one viable solution towards such concerns. Prior watermarking methods map the image to a latent space for adding the watermark. Moreover, Latent Diffusion Models (LDM) generate the image in the latent space of a pre-trained autoencoder. We argue that this latent space can be used to integrate watermarking into the generation process. To this end, we preset LaWa, an in-generation image watermarking method designed for LDMs. By using coarse-to-fine multi-scale watermark embedding modules, LaWa modifies the latent space of pre-trained autoencoders and achieves high robustness against a wide range of image transformations while preserving perceptual quality of the image. We show that LaWa can also be used as a general image watermarking method. Through extensive experiments, we demonstrate that LaWa outperforms previous works in perceptual quality, robustness against attacks, and computational complexity, while having very low false positive rate. Code is available as supplementary materials.

# 337

PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments

rixin zhou · Ding Xia · YI ZHANG · honglin pang · Xi Yang · chuntao li

In this paper, we propose a learning-based image fragment pair-searching and -matching approach to solve the challenging restoration problem. Existing works use rule-based methods to match similar contour shapes or textures, which are always difficult to tune hyperparameters for extensive data and computationally time-consuming. Therefore, we propose a neural network that can effectively utilize neighbor textures with contour shape information to fundamentally improve performance. First, we employ a graph-based network to extract the local contour and texture features of fragments. Then, for the pair-searching task, we adopt a linear transformer-based module to integrate these local features and use contrastive loss to encode the global features of each fragment. For the pair-matching task, we design a weighted fusion module to dynamically fuse extracted local contour and texture features, and formulate a similarity matrix for each pair of fragments to calculate the matching score and infer the adjacent segment of contours. To faithfully evaluate our proposed network, we collect a real dataset and generate a simulated image fragment dataset through an algorithm we designed that tears complete images into irregular fragments. The experimental results show that our proposed network achieves excellent pair-searching accuracy, reduces matching errors, and significantly reduces computational time. Details, sourcecode, and data are available in our supplementary material.

# 2

Strong Double Blind

Delving into Adversarial Robustness on Document Tampering Localization

Huiru Shao · Zhuang Qian · Kaizhu Huang · Wei Wang · Xiaowei Huang · Qiufeng Wang

Recent advances in document forgery techniques produce malicious yet nearly visually untraceable alterations, imposing a big challenge for document tampering localization (DTL). Despite significant recent progress, there has been surprisingly limited exploration of adversarial robustness in DTL. This paper presents the first effort to uncover the vulnerability of most existing DTL models to adversarial attacks, highlighting the need for greater attention within the DTL community. In pursuit of robust DTL, we demonstrate that adversarial training can promote the model's robustness and effectively protect against adversarial attacks. As a notable advancement, we further introduce a latent manifold adversarial training approach that enhances adversarial robustness in DTL by incorporating perturbations on the latent manifold of adversarial examples, rather than exclusively relying on label-guided information. Extensive experiments on DTL benchmark datasets shows the necessity of adversarial training and our proposed manifold-based method significantly improves the adversarial robustness on both white-box and black-box attacks.

# 3

Strong Double Blind

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Lorenzo Baraldi · Federico Cocchi · Marcella Cornia · Lorenzo Baraldi · Alessandro Nicolosi · Rita Cucchiara

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.

# 344

Strong Double Blind

Forbes: Face Obfuscation Rendering via Backpropagation Refinement Scheme

Jintae Kim · Seungwon Yang · Seong-Gyun Jeong · Chang-Su Kim

A novel algorithm for face obfuscation, called Forbes, which aims to obfuscate facial appearance recognizable by humans but preserve the identity and attributes decipherable by machines, is proposed in this paper. Forbes first applies multiple obfuscating transformations with random parameters to an image to remove the identity information distinguishable by humans. Then, it optimizes the parameters to make the transformed image decipherable by machines based on the backpropagation refinement scheme. Finally, it renders an obfuscated image by applying the transformations with the optimized parameters. Experimental results on various datasets demonstrate that Forbes achieves both human indecipherability and machine decipherability excellently. The source codes will be made publicly available.

# 7

Strong Double Blind

Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment

Yufan Liu · Wanqian Zhang · Dayan Wu · Zheng Lin · jingzi Gu · Weiping Wang

Model inversion (MI) attack reconstructs the private training data of a target model given its output, posing a significant threat to deep learning models and data privacy. On one hand, most of existing MI methods focus on searching for latent codes to represent the target identity, yet this iterative optimization-based scheme consumes a huge number of queries to the target model, making it unrealistic especially in black-box scenario. On the other hand, some training-based methods launch an attack through a single forward inference, whereas failing to directly learn high-level mappings from prediction vectors to images. Addressing these limitations, we propose a novel Prediction-to-Image (P2I) method for black-box MI attack. Specifically, we introduce the Prediction Alignment Encoder to map the target model's output prediction into the latent code of StyleGAN. In this way, prediction vector space can be well aligned with the more disentangled latent space, thus establishing a connection between prediction vectors and the semantic facial features. During the attack phase, we further design the Aligned Ensemble Attack scheme to integrate complementary facial attributes of target identity for better reconstruction. Experimental results show that our method outperforms other SOTAs, e.g., compared with RLB-MI, our method improves attack accuracy by 8.5% and reduces query numbers by 99% on dataset CelebA.

# 5

Strong Double Blind

Generalizable Facial Expression Recognition

Yuhang Zhang · Xiuqi Zheng · Chenyi Liang · Jiani Hu · Weihong Deng

SOTA facial expression recognition (FER) methods fail on test sets that have domain gaps with the train set. Recent domain adaptation FER methods need to acquire labeled or unlabeled samples of target domains to fine-tune the FER model, which might be infeasible in real-world deployment. In this paper, we aim to improve the zero-shot generalization ability of FER methods on different unseen test sets using only one train set. Inspired by how humans first detect faces and then select expression features, we propose a novel FER pipeline to extract expression-related features from any given face images. Our method is based on the generalizable face features extracted by large models like CLIP. However, it is non-trivial to adapt the general features of CLIP for specific tasks like FER. To preserve the generalization ability of CLIP and the high precision of the FER model, we design a novel approach that learns sigmoid masks based on the fixed CLIP face features to extract expression features. To further improve the generalization ability on unseen test sets, we separate the channels of the learned masked features according to the expression classes to directly generate logits and avoid using the FC layer to reduce overfitting. We also introduce a channel-diverse loss to make the learned masks separated. Extensive experiments on five different FER datasets verify that our method outperforms SOTA FER methods by large margins. Code is available in the Supp. material.

# 281

Strong Double Blind

Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding

Minh Tran · Yelin Kim · Che-Chun Su · Min Sun · Cheng-Hao Kuo · Mohammad Soleymani

Self-supervised learning methods have demonstrated impressive performance across visual understanding tasks, including human behavior understanding. However, there has been limited work for self-supervised learning for egocentric social videos. Visual processing in such contexts faces several challenges, including noisy input, limited availability of egocentric social data, and the absence of pretrained models tailored to egocentric contexts. We propose Ex2Eg, a novel framework leveraging novel-view face synthesis for dynamic perspective data augmentation from abundant exocentric videos and enhance self-supervised learning process for VideoMAE via: 1) reconstructing exocentric videos from masked dynamic perspective videos; and 2) predicting feature representations of a teacher model based on the corresponding exocentric frames. Experimental results demonstrate that Ex2Eg consistently excels across diverse social role understanding tasks. It achieves state-of-the-art results in Ego4D's \Talk-to-me challenge (+0.7% mAP, +3.2% Accuracy). For the Look-at-me challenge, it achieves competitive performance with the state-of-the-art (-0.7% mAP, +1.5% Accuracy) without supervised training on external data. On the EasyCom dataset, our method surpasses both supervised Active Speaker Detection approaches and state-of-the-art video encoders (+1.2% mAP, +1.9% Accuracy compared to MARLIN).

# 211

MinD-3D: Reconstruct High-quality 3D objects in Human Brain

Jianxiong Gao · Yuqian Fu · Yun Wang · Xuelin Qian · Jianfeng Feng · Yanwei Fu

In this paper, we introduce Recon3DMind, an innovative task aimed at reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals, marking a significant advancement in the fields of cognitive neuroscience and computer vision. To support this pioneering task, we present the fMRI-Shape dataset, which includes data from 14 participants and features 360-degree videos of 3D objects to enable comprehensive fMRI signal capture across various settings, thereby laying a foundation for future research. Furthermore, we propose MinD-3D, a novel and effective three-stage framework specifically designed to decode the brain’s 3D visual information from fMRI signals, demonstrating the feasibility of this challenging task. The framework begins by extracting and aggregating features from fMRI frames through a neuro-fusion encoder, subsequently employs a feature bridge diffusion model to generate visual features, and ultimately recovers the 3D object via a generative transformer decoder. We assess the performance of MinD-3D using a suite of semantic and structural metrics and analyze the correlation between the features extracted by our model and the visual regions of interest (ROIs) in fMRI signals. Our findings indicate that MinD-3D not only reconstructs 3D objects with high semantic relevance and spatial similarity but also significantly enhances our understanding of the human brain’s capabilities in processing 3D visual information. Project page at: https://jianxgao.github.io/MinD-3D.

# 228

Strong Double Blind

Pathformer3D: A 3D Scanpath Transformer for 360° Images

Rong Quan · yantao Lai · Mengyu Qiu · Dong Liang

Scanpath prediction in 360° images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for 360° images execute scanpath prediction on 2D equirectangular projection plane, which always result in big computation error owing to the 2D plane's distortion and coordinate discontinuity. In this work, we perform scanpath prediction for 360° images in 3D spherical coordinate system and proposed a novel 3D scanpath Transformer named Pathformer3D. Specifically, a 3D Transformer encoder is first used to extract 3D contextual feature representation for the 360° image. Then, the contextual feature representation and historical fixation information are input into a Transformer decoder to output current time step's fixation embedding, where the self-attention module is used to imitate the visual working memory mechanism of human visual system and directly model the time dependencies among the fixations. Finally, a 3D Gaussian distribution is learned from each fixation embedding, from which the fixation position can be sampled. Evaluation on four panoramic eye-tracking datasets demonstrates that Pathformer3D outperforms the current state-of-the-art methods.

# 248

Eliminating Warping Shakes for Unsupervised Online Video Stitching

Lang Nie · Chunyu Lin · Kang Liao · Yun Zhang · Shuaicheng Liu · Rui Ai · Yao Zhao

In this paper, we retarget video stitching to an emerging issue, named warping shake, when extending image stitching to video stitching. It unveils the temporal instability of warped content in non-overlapping regions, even though image stitching has endeavored to preserve the natural structures. Therefore, in most cases, even if the input videos to be stitched are stable, the stitched video will inevitably cause undesired warping shakes and affect the visual experience. To eliminate the shakes, we propose StabStitch to simultaneously realize video stitching and video stabilization in a unified unsupervised learning framework. Starting from the camera paths in video stabilization, we first derive the expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Then a warp smoothing model is presented to optimize them with a comprehensive consideration regarding content alignment, trajectory smoothness, spatial consistency, and online collaboration. To establish an evaluation benchmark and train the learning framework, we build a large-scale video stitching dataset with a rich diversity in camera motions and scenes. Compared with existing stitching solutions, StabStitch exhibits significant superiority in scene robustness and inference speed in addition to stitching and stabilization performance, contributing to a robust and real-time online video stitching system. The codes and dataset will be available.

# 116

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

Wanyun Li · Pinxue Guo · Xinyu Zhou · Lingyi Hong · Yangji He · Xiangyu Zheng · Wei Zhang · Wenqiang Zhang

Contemporary Video Object Segmentation (VOS) approaches typically consist stages of feature extraction, matching, memory management, and multiple objects aggregation. Recent advanced models either employ a discrete modeling for these components in a sequential manner, or optimize a combined pipeline through substructure aggregation. However, these existing explicit staged approaches prevent the VOS framework from being optimized as a unified whole, leading to the limited capacity and suboptimal performance in tackling complex videos. In this paper, we propose OneVOS, a novel framework that unifies the core components of VOS with All-in-One Transformer. Specifically, to unify all aforementioned modules into a vision transformer, we model all the features of frames, masks and memory for multiple objects as transformer tokens, and integrally accomplish feature extraction, matching and memory management of multiple objects through the flexible attention mechanism. Furthermore, a Unidirectional Hybrid Attention is proposed through a double decoupling of the original attention operation, to rectify semantic errors and ambiguities of stored tokens in OneVOS framework. Finally, to alleviate the storage burden and expedite inference, we propose the Dynamic Token Selector, which unveils the working mechanism of OneVOS and naturally leads to a more efficient version of OneVOS. Extensive experiments demonstrate the superiority of OneVOS, achieving state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1\% and 66.4\% $J \& F$ scores, surpassing previous state-of-the-art methods by 4.2\% and 7.0\%, respectively. And our code will be available for reproducibility and further research.

# 268

Strong Double Blind

Semantically Guided Representation Learning For Action Anticipation

Anxhelo Diko · Danilo Avola · Bardh Prenkaj · Federico Fontana · Luigi Cinque

Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their semantic interconnectivity based on prototypical action patterns and contextual co-occurrences. To this end, we propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity. To gather insights on S-GEAR's effectiveness, we experiment on four action anticipation benchmarks, obtaining improved results compared to previous works: +3.5, +2.7, and +3.5 on Top-1 accuracy on Epic-Kitchen 55, EGTEA Gaze+ and 50 Salads, respectively, and +0.8 on Top-5 Recall on Epic-Kitchens 100. We further observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes. Finally, by exploring the intricate impact of action semantic interconnectivity, S-GEAR opens new research frontiers in anticipation tasks.

# 256

Strong Double Blind

SIGMA: Sinkhorn-Guided Masked Video Modeling

Mohammadreza Salehi · Michael Dorkenwald · Fida Mohammad Thoker · Efstratios Gavves · Cees Snoek · Yuki M Asano

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods.

# 255

Strong Double Blind

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rui Qian · Shuangrui Ding · Dahua Lin

Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top of the spatial module. However, these attempts are subject to limitations in efficiency and interpretability. In this paper, we propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective. Inspired by human perception, which identifies objects as key components for video understanding, we integrate a proxy task of object discovery into image-to-video transfer learning. Specifically, we adopt slot attention with learnable queries to distill each frame into a compact set of object tokens. These object-centric tokens are then processed through object-time interaction layers to model object state changes across time. Integrated with two novel object-level losses, we demonstrate the feasibility of performing efficient temporal reasoning solely on the compressed object-centric representations for video downstream tasks. Our method achieves state-of-the-art performance with fewer tunable parameters, only 5\% of fully finetuned models and 50\% of efficient tuning methods, on action recognition benchmarks. In addition, our model performs favorably in zero-shot video object segmentation without further retraining or object annotations, proving the effectiveness of object-centric video understanding.

# 261

Strong Double Blind

RICA^2: Rubric-Informed, Calibrated Assessment of Actions

Abrar Majeedi · Viswanatha Reddy Gajjala · Satya Sai Srinath Namburi GNVV · Yin Li

The ability to quantify how well an action is carried out, also known as action quality assessment (AQA), is widely studied across scientific disciplines due to its broad range of applications. Therefore, there has been a surging interest in the vision community to develop video-based AQA. Unfortunately, prior methods often ignore the score rubric used by human experts and fall short at quantifying the uncertainty of the model prediction. To bridge the gap, we present RICA^2--- a deep probabilistic model that integrates score rubric and accounts for prediction uncertainty for AQA. Central to our method lies in stochastic embeddings of action steps, defined on a graph structure that encodes the score rubric. The embeddings spread probabilistic density in the latent space, and allow our method to represent model uncertainty. The graph encodes the scoring criteria, based on which the quality scores can be decoded. We demonstrate that our method establishes new state-of-the-art on public benchmarks including FineDiving, MTL-AQA, and JIGSAWS, with superior performance in score prediction and uncertainty calibration.

# 270

Strong Double Blind

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Fuchen Long · Zhaofan Qiu · Ting Yao · Tao Mei

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.

# 262

Strong Double Blind

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Minghang Zheng · Xinhao Cai · Qingchao Chen · Yuxin Peng · Yang Liu

Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training, with high data collection costs, but exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free zero-shot Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, firstly, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description provided by LLMs, we use VLMs to locate the top-k proposals that are most relevant to the description and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.

# 266

Strong Double Blind

EA-VTR: Event-Aware Video-Text Retrieval

Zongyang Ma · Ziqi Zhang · Yuxin Chen · Zhongang Qi · Chunfeng Yuan · Bing Li · Yingmin Luo · Xu LI · Qi Xiaojuan · Ying Shan · Weiming Hu

Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improvements from both data and model perspectives. In terms of pre-training data, we focus on supplementing the missing specific event content and event temporal transitions with the proposed event augmentation strategies. Based on the event-augmented data, we construct a novel Event-Aware Video-Text Retrieval model, \ie, EA-VTR, which achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events. Our method not only significantly outperforms existing approaches on multiple datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but also demonstrates superior event content perceive ability on Multi-event Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding event temporal logic understanding ability on Test of Time task.

# 267

Strong Double Blind

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Wufei Ma · Kai Li · Zhongshi Jiang · Moustafa Meshry · Qihao Liu · Huiyu Wang · Christian Haene · Alan Yuille

Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the content of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance. In order to narrow the gap between video-text models and human performance on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model. Experiments and analyses show that our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models. We will release our dataset upon acceptance.

# 269

FunQA: Towards Surprising Video Comprehension

Binzhu Xie · Sicheng Zhang · Zitang Zhou · Bo Li · Yuanhan Zhang · Jack Hessel · Jingkang Yang · Ziwei Liu

Surprising videos, e.g., funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question-answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA clips, spanning a total of 24 video hours. Moreover, we propose FunMentor, an agent designed for Vision-Language Models (VLMs) that uses multi-turn dialogues to enhance models’ understanding of counter-intuitiveness. Extensive experiments with existing VLMs demonstrate the effectiveness of FunMentor and reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.

# 264

Strong Double Blind

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Yuxiao Chen · Kai Li · Wentao Bao · Deep Patel · Yu Kong · Martin Renqiang Min · Dimitris N. Metaxas

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, \ie, irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work propose a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9\%, 3.1%, and 2.8%.

# 308

Efficient Pre-training for Localized Instruction Generation of Procedural Videos

Anil Batra · Davide Moltisanti · Laura Sevilla-Lara · Marcus Rohrbach · Frank Keller

Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve & Swap, to automatically generate high quality training data for the recipe domain: (i) Sieve filters irrelevant transcripts and (ii) Swap acquires high quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve & Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. Our code will be publicly released.

# 265

Strong Double Blind

Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality

Kyu Ri Park · Hong Joo Lee · Jung Uk Kim

Current research on Audio-Visual Question Answering (AVQA) tasks typically requires complete visual and audio input to effectively understand scenes and answer the questions (text). However, in real-world scenarios, problems such as device malfunction or transmission errors are common, resulting in the occasional absence of audio or visual information. Such scenarios significantly degrade the performance of AVQA networks. To address these challenges, we propose a novel AVQA framework that effectively handles missing modalities and provides answers to questions even when audio or visual information is missing. Our framework proposes a Relation-aware Missing Modal (RMM) generator, inspired by human sensory association, to create pseudo features that retrieve missing modality information by correlating available modality cues. Then, we introduce an Audio-Visual Relation-aware (AVR) diffusion model to improve both the overall audio-visual feature representations (missing modality pseudo feature and original modality feature) by considering the associations between them. As a result, our approach outperforms state-of-the-art AVQA works, even in the cases where audio or visual modalities are missing. We believe that our method enables realistic studies in AVQA networks and has the potential for application in various multi-modal scenarios. The code will be made publicly available.

# 272

Strong Double Blind

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Yaoting Wang · Peiwen Sun · Yuanchao Li · Honggang Zhang · Di Hu

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets.

# 104

Strong Double Blind

Visual Alignment Pre-training for Sign Language Translation

Peiqi Jiao · Yuecong Min · Xilin CHEN

Sign Language Translation (SLT) aims to translate sign videos into spoken sentences. While gloss sequences, the written approximation of sign videos, provide informative alignment supervision for visual representation learning in SLT, the associated high cost of gloss annotations hampers the scalability. Recent works have yet to achieve satisfactory results without gloss annotations. In this study, we attribute the challenge to the flexible correspondence between visual and textual tokens, and aim to address it by constructing a gloss-like constraint from spoken sentences. Specifically, we propose a Visual Alignment Pre-training (VAP) scheme to exploit visual information by aligning visual and textual tokens in a greedy manner. The VAP scheme enhances visual encoder in capturing semantic-aware visual information and facilitates better adaptation with pre-trained translation modules. Experimental results across four SLT benchmarks demonstrate the effectiveness of the proposed method, which can not only generate reasonable alignments but also significantly narrow the performance gap with gloss-based methods.

# 142

Strong Double Blind

Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Shizhou Zhang · Wenlong Luo · De Cheng · Qingchun Yang · Lingyan Ran · Yinghui Xing · Yanning Zhang

In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.

# 138

Strong Double Blind

Spectral Subsurface Scattering for Material Classification

Haejoon Lee · Aswin C. Sankaranarayanan

This study advances material classification using Spectral Sub-Surface Scattering (S4) measurements. While spectrum and subsurface scattering measurements have, individually, been used extensively in material classification, we argue that the strong spectral dependence of subsurface scattering lends itself to highly discriminative features. However, obtaining S4 measurements requires a time-consuming hyperspectral scan. We avoid this by showing that a carefully chosen 2D projection of the S4 point spread function is sufficient for material estimation; specifically, we show that the parameters defining a physics model for S4 can be estimated from this 2D projection. We also design and implement a novel imaging setup, consisting of a point-array illumination and a spectrally-dispersing camera, to make the 2D projections. Through comprehensive experiments, we demonstrate the superiority of S4 imaging over spectral and sub-surface scattering measurements.

# 31

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

Vishal Nedungadi · Ankit Kariryaa · Stefan Oehmcke · Serge Belongie · Christian Igel · Nico Lang

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that pretraining with multi-modal pretext tasks notably improves the linear probing performance compared to pretraining on optical satellite images only. This also leads to better label efficiency and parameter efficiency which are crucial aspects in global scale applications.

# 162

MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Gabriele Berton · Lorenz Junglas · Riccardo Zaccone · Thomas Pollok · Barbara Caputo · Carlo Masone

Mesh-based scene representation offers a promising direction for simplifying large-scale hierarchical visual localization pipelines combining a visual place recognition step based on global features (retrieval) and a visual localization step based on local features. While existing work demonstrates the viability of meshes for visual localization, the impact of using synthetic databases rendered from them in visual place recognition remains largely unexplored. In this work we investigate using dense 3D textured meshes for large-scale Visual Place Recognition (VPR) and identify a significant performance drop when using synthetic mesh-based databases compared to real-world images for retrieval. To address this, we propose MeshVPR, a novel VPR pipeline that utilizes a lightweight features alignment framework to bridge the gap between real-world and synthetic domains. MeshVPR leverages pre-trained VPR models and it is efficient and scalable for city-wide deployments. We introduce novel datasets with freely available 3D meshes and manually collected queries from Berlin, Paris, and Melbourne. Extensive evaluations demonstrate that MeshVPR achieves competitive performance with standard VPR pipelines, paving the way for mesh-based localization systems. Our contributions include the new task of citywide mesh-based VPR, the new benchmark datasets, MeshVPR, and a thorough analysis of open challenges.

# 144

Strong Double Blind

Frontier-enhanced Topological Memory with Improved Exploration Awareness for Embodied Visual Navigation

Xinru Cui · Qiming Liu · Zhe Liu · Hesheng Wang

We present a novel graph memory structure for navigation, called Frontier-enhanced Topological Memory (FTM). Most prior research primarily focuses on maintaining memory representations for explored areas. In contrast, our approach incorporates ghost nodes into the topological map to characterize unexplored but visible regions. The ghost nodes are generated using a geometric method and serve to indicate the geometrically explorable frontiers, thereby promoting agent's exploration and target search. We employ an online-trained implicit representation method to predict perceptual features for ghost nodes based on previous observations. In addition, we develop a Multi-Stage Memory Extraction module (MSME) that can effectively utilize the FTM. It focuses particularly on task-specific information and generates actions end-to-end. By using FTM, the agent can improve its capacity for environmental cognition and memory utilization. We evaluate the proposed approach on visual navigation in the photo-realistic Gibson environment. Experimental results conclusively demonstrate that the proposed navigation framework with FTM boosts the agent's exploration awareness and enhances the performance in image-goal navigation tasks.

# 289

Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Yuan Chen · Zi-han Ding · Ziqin Wang · Yan Wang · Lijun Zhang · Si Liu

Despite real-time planners exhibiting remarkable performance in autonomous driving, the growing exploration of Large Language Models (LLMs) has opened avenues for enhancing the interpretability and controllability of motion planning. Nevertheless, LLM-based planners continue to encounter significant challenges, including elevated resource consumption and extended inference times, which pose substantial obstacles to practical deployment. In light of these challenges, we introduce AsyncDriver, a new asynchronous LLM-enhanced closed-loop framework designed to leverage scene-associated instruction features produced by LLM to guide real-time planners in making precise and controllable trajectory predictions. On one hand, our method highlights the prowess of LLMs in comprehending and reasoning with vectorized scene data and a series of routing instructions, demonstrating its effective assistance to real-time planners. On the other hand, the proposed framework decouples the inference processes of the LLM and real-time planners. By capitalizing on the asynchronous nature of their inference frequencies, our approach have successfully reduced the computational cost introduced by LLM, while maintaining comparable performance. Experiments show that our approach achieves superior closed-loop evaluation performance on nuPlan's challenging scenarios.

# 339

Strong Double Blind

Controllable Navigation Instruction Generation with Chain of Thought Prompting

Xianghao Kong · Jinyu Chen · Wenguan Wang · Hang Su · Xiaolin Hu · Yi Yang · Si Liu

Instruction generation is a vital and multidisciplinary research area with broad applications. Existing instruction generation models are limited to generating instructions in a single style from a particular dataset, and the style and content of generated instructions cannot be controlled. Moreover, most existing instruction generation methods also disregard the spatial modeling of the navigation environment. Leveraging the capabilities of Large Language Models (LLMs), we propose C-Instructor, which utilizes the chain-of-thought-style prompt for style-controllable and content-controllable instruction generation. Firstly, we propose a Chain of Thought with Landmarks (CoTL) mechanism, which guides the LLM to identify key landmarks and then generate complete instructions. CoTL renders generated instructions more accessible to follow and offers greater controllability over the manipulation of landmark objects. Furthermore, we present a Spatial Topology Modeling Task to facilitate the understanding of the spatial structure of the environment. Finally, we introduce a Style-Mixed Training policy, harnessing the prior knowledge of LLMs to enable style control for instruction generation based on different prompts within a single model instance. Extensive experiments demonstrate that instructions generated by C-Instructor outperform those generated by previous methods in text metrics, navigation guidance evaluation, and user studies. Our code will be released.

# 113

Strong Double Blind

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Gengze Zhou · Yicong Hong · Zun Wang · Xin Eric Wang · Qi Wu

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.

# 141

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Meng Chu · Zhedong Zheng · Wei Ji · Tingyu Wang · Tat-Seng Chua

Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.

# 117

Strong Double Blind

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

jiha jang · Hoigi Seo · Se Young Chun

Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images / illustrations and is capable of performing affordance grounding for novel interactions and objects.

# 112

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia · Yixin Chen · Huangyue Yu · Yan Wang · Xuesong Niu · Tengyu Liu · Qing Li · Siyuan Huang

3D vision-language grounding, which aims to align language with 3D physical environments, stands as a cornerstone in developing embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces two significant challenges: (i) the scarcity of paired 3D vision-language data to support grounded learning of 3D scenes, especially considering complexities within diverse object configurations, rich attributes, and intricate relationships; and (ii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on existing 3D visual grounding and question-answering benchmarks. We also show that the data scale-up effect is not limited to GPS, but is generally beneficial for 3D models on 3D vision-language (3D-VL) tasks like semantic segmentation. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks.

# 321

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You · Haotian Zhang · Eldon Schoop · Floris Weers · Amanda Swearngin · Jeff Nichols · Yinfei Yang · Zhe Gan

The recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we construct Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. we meticulously gathered training samples from an extensive range of fundamental UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. Moreover, to augment the model's reasoning ability, we compile a dataset for advanced tasks inspired by Ferret, but with a focus on mobile screens. This methodology enables the training of Ferret-UI, a model that exhibits outstanding comprehension of UI screens and the ability to execute open-ended instructions, thereby facilitating UI operations. To rigorously evaluate its capabilities, we establish a comprehensive benchmark encompassing the aforementioned tasks. Ferret-UI not only outstrips most open-source UI MLLMs in performance but also achieves parity with GPT-4V, marking a significant advancement in the field.

# 121

Strong Double Blind

Quality Assured: Rethinking Annotation Strategies in Imaging AI

Tim Rädsch · Annika Reinke · Vivienn Weru · Minu D. Tizabi · Nicholas Heller · Fabian Isensee · Annette Kopp-Schneider · Lena Maier-Hein

This paper does not describe a novel method. Instead, it studies an essential foundation for reliable benchmarking and ultimately real-world application of AI-based image analysis: generating high-quality reference annotations. Previous research has focused on crowdsourcing as a means of outsourcing annotations. However, little attention has so far been given to annotation companies, specifically regarding their internal quality assurance (QA) processes. Therefore, our aim is to evaluate the influence of QA employed by annotation companies on annotation quality and devise methodologies for maximizing data annotation efficacy. Based on a total of 57,648 instance segmented images obtained from a total of 924 annotators and 34 QA workers from four annotation companies and Amazon Mechanical Turk (MTurk), we derived the following insights: (1) Annotation companies perform better both in terms of quantity and quality compared to the widely used platform MTurk. (2) Annotation companies' internal QA only provides marginal improvements, if any. However, improving labeling instructions instead of investing in QA can substantially boost annotation performance. (3) The benefit of internal QA depends on specific image characteristics. Our work could enable researchers to derive substantially more value from a fixed annotation budget and change the way annotation companies conduct internal QA.

# 304

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Rizhao Cai · Zirui Song · DAYAN GUAN · Zhenhao Chen · Yaohang Li · Xing Luo · Chenyu Yi · Alex Kot

Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning on data in common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs toward three different styles: artistic image style, imaging sensor style, and application style. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs. The benchmark and evaluation have been released on https://github.com/AIFEG/BenchLMM

# 271

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Cheng Tan · Jingxuan Wei · Zhangyang Gao · Linzhuang Sun · Siyuan Li · Ruifeng Guo · BiHui Yu · Stan Z. Li

Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions. Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework, separating rationale generation from answer inference. However, these approaches often fall short due to the inadequate quality of the generated rationales. In this work, we delve into the importance of rationales in model reasoning. We observe that when rationales are completely accurate, the model's accuracy significantly improves, highlighting the need for high-quality rationale generation. Motivated by this, we propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process. This approach not only enhances the quality of generated rationales but also leads to more accurate and robust answers. Through extensive experiments, we demonstrate that our approach significantly improves model performance across various benchmarks. Remarkably, we show that even smaller base models, when equipped with our proposed approach, can achieve results comparable to those of larger models, illustrating the potential of our approach in harnessing the power of rationales for improved multimodal reasoning.

# 53

Strong Double Blind

A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

Xiang Liu · Zhaoxiang Liu · Huan Hu · Zezhou Chen · Kohou Wang · Kai Wang · Shiguo Lian

While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications.

# 103

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan · Jaemin Cho · Elias Stengel-Eskin · Mohit Bansal

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a “visual prompt”, where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data with visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer. CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG’s applicability to spatial reasoning, with 10% improvement on What’sUp, as well as to compositional generalization – improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe – and to image-text alignment for generated images, where we improve by 8.4 AUROC and 6.8 F1 points on SeeTRUE. CRG also allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of 3.2% in accuracy. Our analysis explores alternative masking strategies for CRG, empirically validating CRG’s design choices.

# 85

Strong Double Blind

DEAL: Disentangle and Localize Concept-level Explanations for VLMs

Tang Li · Mengmeng Ma · Xi Peng

Large pre-trained Vision-Language Models (VLMs) have become ubiquitous foundational components of other models and downstream tasks. Although powerful, our empirical results reveal that such models might not be able to identify fine-grained concepts. Specifically, the explanations of VLMs with respect to fine-grained concepts are entangled and mislocalized. To address this issue, we propose to DisEntAngle and Localize (DEAL) the concept-level explanations for VLMs without human annotations. The key idea is encouraging the concept-level explanations to be distinct while maintaining consistency with category-level explanations. We conduct extensive experiments and ablation studies on a wide range of benchmark datasets and vision-language models. Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability. Surprisingly, the improved explainability alleviates the model's reliance on spurious correlations, which further benefits the prediction accuracy.

# 99

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Samuele Poppi · Tobia Poppi · Federico Cocchi · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

# 106

FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Hang Hua · Jing Shi · Kushal Kafle · Simon Jenni · Daoan Zhang · John Collomosse · Scott Cohen · Jiebo Luo

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs’ compositionality for aspect-based fine-grained text and image matching. In this task, the models need to predict the mismatched aspect phrases, identify the class of the aspect, and suggest their corrections for a given image and a text caption with 0 to 3 mismatched aspects. To evaluate the models’ performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis as we might have expected. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.

# 102

Strong Double Blind

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Dongsheng Wang · Jiequan Cui · Miaoge Li · Wang Lin · Bo Chen · Hanwang Zhang

As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC), that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC. Codes are in the Appendix.

# 101

Strong Double Blind

IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models

Kai Huang · Hao Zou · Ye Xi · Bochen Wang · Zhen Xie · Liang Yu

Inspired by the remarkable achievements of Large Language Models (LLMs), Large Vision-Language Models (LVLMs) have likewise experienced significant advancements. However, the increased computational cost and token budget occupancy associated with lengthy visual tokens pose significant challenge to the practical applications. Considering that not all visual tokens are essential to the final response, selectively pruning redundant visual tokens can effectively alleviate this challenge. In this paper, we present a novel Instruction-guided Visual Token Pruning (IVTP) approach for LVLMs, which is designed to strike a better balance between computational efficiency and the performance. Specifically, a Group-wise Token Pruning (GTP) module based on attention rollout is integrated into the grouped transformer layer to achieve intra-group attention aggregation via residual connection, thereby improving the assessment of visual token importance, especially for LVLMs with a frozen visual encoder. We then extend the module to LLM in order to further filter out visual tokens that are pertinent to the current textual instructions, by introducing a semantically related pseudo CLS token to serve as a reference for token pruning. This two-stage token pruning mechanism permits a systematic and efficient reduction in the quantity of visual tokens while preserving essential visual information. We apply the proposed method to the most representative LVLM, i.e. LLaVA-1.5. Experimental results demonstrate that when the number of visual tokens is reduced by 88.9%, the computational complexity is decreased by over 46%, with only an average 1.0% accuracy drop across 12 benchmarks, and remarkably surpasses the state-of-the-art token pruning methods. It is worth noting that the proposed method can also work without requiring retraining, thus enabling it to serve as a plug-in across a broader range of LVLMs. Code and trained weights will be available.

# 100

Strong Double Blind

LookupViT: Compressing visual information to a limited number of tokens

Rajat Koner · Gagan Jain · Sujoy Paul · Volker Tresp · Prateek Jain

Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually sparse. In this work, we introduce LookupViT, that aims to exploit this information sparsity to reduce the cost of ViT inference. LookupViT provides a novel general purpose vision transformer block that operates by compressing information from higher resolution tokens to a fixed number of tokens. These few compressed tokens undergo meticulous processing, while the higher-resolution tokens are passed through computationally cheaper layers. Information sharing between these two token sets is enabled through a bidirectional cross-attention mechanism. The approach offers multiple advantages - (a) easy to implement on standard ML accelerators (GPUs/TPUs) via standard high-level operators, (b) applicable to standard ViT and its variants, thus generalizes to various tasks, (c) can handle different tokenization and attention approaches. LookupViT also offers flexibility for the compressed tokens, enabling performance-computation trade-offs in a single trained model. LookupViT's effectiveness on multiple domains - (a) for image-classification (ImageNet-1K and ImageNet-21K), (b) video classification (Kinetics400 and Something-Something V2), (c) image captioning (COCO-Captions) with a frozen encoder. LookupViT provides 2x reduction in FLOPs while upholding or improving accuracy across these domains. In addition, LookupViT also demonstrates out-of-the-box robustness on corrupted image classification (ImageNet-C), improving by more than 4% over ViT.

# 94

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Ziyi Lin · Dongyang Liu · Renrui Zhang · Peng Gao · Longtian Qiu · Han Xiao · Han Qiu · Wenqi Shao · Keqin Chen · Jiaming Han · Siyuan Huang · Yichi Zhang · Xuming He · Yu Qiao · Hongsheng LI

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, visual embeddings and image scales. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. We further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications, with highlighted fine-grained visual recognition abilities such as region-level understanding, caption grounding, document layout detection, and human pose estimation. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.

# 80

Strong Double Blind

Integration of Global and Local Representations for Fine-grained Cross-modal Alignment

Seungwan Jin · Hoyoung Choi · Taehyung Noh · Kyungsik Han

Fashion is one of the representative domains of fine-grained Vision-Language Pre-training (VLP) involving a large number of images and text. Previous fashion VLP research has proposed various pre-training tasks to account for fine-grained details in multimodal fusion. However, fashion VLP research has not yet addressed the need to focus on (1) uni-modal embeddings that reflect fine-grained features and (2) hard negative samples to improve the performance of fine-grained V+L retrieval tasks. In this paper, we propose Fashion-FINE (Fashion VLP with Fine-grained Cross-modal Alignment using the INtegrated representations of global and local patch Embeddings), which consists of three key modules. First, a modality-agnostic adapter (MAA) learns uni-modal integrated representations and reflects fine-grained details contained in local patches. Second, hard negative mining with focal loss (HNM-F) performs cross-modal alignment using the integrated representations, focusing on hard negatives to boost the learning of fine-grained cross-modal alignment. Third, comprehensive cross-modal alignment (C-CmA) extracts low- and high-level fashion information from the text and learns the semantic alignment to encourage disentangled embedding of the integrated image representations. Fashion-FINE achieved state-of-the-art performance on two representative public benchmarks (i.e., FashionGen and FashionIQ) in three representative V+L retrieval tasks, demonstrating its effectiveness in learning fine-grained features.

# 325

Strong Double Blind

Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation

Peixi Xiong · Michael A Kozuch · Nilesh Jain

Text-to-image generation plays a pivotal role in computer vision and natural language processing by translating textual descriptions into visual representations. However, understanding complex relations in detailed text prompts filled with rich relational content remains a significant challenge. To address this, we introduce a novel task: Logic-Rich Text-to-Image (LRT2I) generation. Unlike conventional image generation tasks that rely on short and structurally simple natural language inputs, our task focuses on intricate text inputs abundant in relational information. To tackle these complexities, we collect the Textual-Visual Logic (TV-Logic) dataset, designed to evaluate the performance of text-to-image generation models across diverse and complex scenarios. Furthermore, we propose a baseline model as a benchmark for this task. Our model comprises three key components: a negative pair discriminator, a relation understanding module, and a multimodality fusion module. These components enhance the model's ability to handle disturbances in informative tokens and prioritize relational elements during image generation.

# 97

MyVLM: Personalizing VLMs for User-Specific Queries

Yuval Alaluf · Elad Richardson · Sergey Tulyakov · Kfir Aberman · Danny Cohen-Or

Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM the identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs. Code and data will be made available upon acceptance.

# 96

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen · Jinsong Li · Xiaoyi Dong · Pan Zhang · Conghui He · Jiaqi Wang · Feng Zhao · Dahua Lin

Modality alignment serves as the cornerstone for large multi-modal models (LMMs). However, the impact of different attributes (e.g., data type, quality, and scale) of training data on facilitating effective alignment is still under-explored. In this paper, we delve into the influence of training data on LMMs, uncovering three pivotal findings: 1) Highly detailed captions enable more nuanced vision-language alignment, significantly boosting the performance of LMMs in diverse benchmarks, surpassing outcomes from brief captions or VQA data; 2) Cutting-edge LMMs can be close to the captioning capability of costly human annotators, and open-source LMMs could reach similar quality after lightweight fine-tuning; 3) The performance of LMMs scales with the number of detailed captions, exhibiting remarkable improvements across a range from thousands to millions of captions. Drawing from these findings, we introduce the ShareGPT4V series for advanced modality alignment. It includes ShareGPT4V, consisting of 100K high-quality captions curated from GPT4-Vision; ShareGPT4V-PT, containing 1.2M captions produced by our Share-Captioner that can be close to the captioning capabilities of GPT4-Vision; and ShareGPT4V-7B, a simple yet superior LMM excelling in most multi-modal benchmarks, which realized better alignment based on our large-scale high-quality captions.

# 95

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo · Justin Johnson · Honglak Lee

This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D~\cite{luo2023scalable} method, which translates 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing errors. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the correlation between 3D objects and their 2D rendered views. This process improves caption accuracy and details by prioritizing views that are more representative of the object's characteristics. By combining it with GPT4-Vision, we mitigates caption hallucination, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for Visual Question Answering task, where it outperforms CLIP model.

# 92

GRiT: A Generative Region-to-text Transformer for Object Understanding

Jialian Wu · Jianfeng Wang · Zhengyuan Yang · Zhe Gan · Zicheng Liu · Junsong Yuan · Lijuan Wang

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as pairs, where region locates objects and text describes objects. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate natural language for objects. With the same model architecture, GRiT describes objects via not only simple nouns, but also rich descriptive sentences. We define GRiT as open-set object understanding, as it has no limit on object description output from the model architecture perspective. Experimentally, we apply GRiT to dense captioning and object detection tasks. GRiT achieves new state-of-the-art dense captioning performance (15.5 mAP on Visual Genome) and competitive detection accuracy (60.4 AP on COCO test-dev).

# 303

FreestyleRet: Retrieving Images from Style-Diversified Queries

Hao Li · Yanhao Jia · Peng Jin · Zesen Cheng · Kehan Li · Jialu Sui · Chang Liu · Li Yuan

Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles. To facilitate the novel setting, we propose the first Diverse-Style Retrieval dataset, encompassing diverse query styles including text, sketch, low-resolution, and art. We also propose a light-weighted style-diversified retrieval framework. For various query style inputs, we apply the Gram Matrix to extract the query's textural features and cluster them into a style space with style-specific bases. Then we employ the style-init prompt tuning module to enable the visual encoder to comprehend the texture and style information of the query. Experiments demonstrate that our model, employing the style-init prompt tuning strategy, outperforms existing retrieval models on the style-diversified retrieval task. Moreover, style-diversified queries (sketch+text, art+text, etc) can be simultaneously retrieved in our model. The auxiliary information from other queries enhances the retrieval performance within the respective query

# 108

Strong Double Blind

LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation

Pengwei Yin · Jingjing Wang · Guanzhong Zeng · Di Xie · Jiang Zhu

The ability of gaze estimation models to generalize is often significantly hindered by various factors unrelated to gaze, especially when the training dataset is limited. Current strategies aim to address this challenge through different domain generalization techniques, yet they have had limited success due to the risk of overfitting when solely relying on value labels for regression. Recent progress in pre-trained vision-language models has motivated us to capitalize on the abundant semantic information available. We propose a novel approach in this paper, reframing the gaze estimation task as a vision-language alignment issue. Our proposed framework, named Language-Guided Gaze Estimation (LG-Gaze), learns continuous and geometry-sensitive features for gaze estimation benefit from the rich prior knowledges of vision-language models. Specifically, LG-Gaze aligns gaze features with continuous linguistic features through our proposed multimodal contrastive regression loss, which customizes adaptive weights for different negative samples. Furthermore, to better adapt to the labels for gaze estimation task, we propose a geometry-aware interpolation method to obtain more precise gaze embeddings. Through extensive experiments, we validate the efficacy of our framework in four different cross-domain evaluation tasks.

# 114

Strong Double Blind

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Yini Fang · Jingling Yu · Haozheng Zhang · Ralf van der Lans · Bertram E Shi

Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than spatial coordinates. This paper introduces the Object-level Attention Transformer (OAT) that predicts human scanpaths as they search for a target object within a cluttered scene of distractor objects. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new positional encoding that better reflects spatial relationships between objects. We evaluated OAT on the Amazon book cover dataset and a new dataset for visual search that we collected. OAT's predicted gaze scanpaths align more closely align with human gaze patterns, compared to algorithms based on spatial attention as measured by established metrics and a novel behavioral-based metric. Our results show the generalizability of OAT, as it accurately predicts human scanpaths for unseen layouts and target objects.

# 81

Strong Double Blind

Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks

Manyuan Zhang · Guanglu Song · Xiaoyu Shi · Yu Liu · Hongsheng LI

In this paper, we investigate how to conduct transfer learning to adapt Stable Diffusion to downstream visual dense prediction tasks such as object detection and semantic segmentation. We focus on fine-tuning the Stable Diffusion model, which has demonstrated impressive abilities in modeling image details and high-level semantics. Through our experiments, we have three key insights. Firstly, we demonstrate that for dense prediction tasks, the denoiser of Stable Diffusion can serve as a stronger feature encoder compared to visual-language models pre-trained with contrastive training (e.g., CLIP). Secondly, we show that the quality of extracted features is influenced by the diffusion sampling step $t$, sampling layer, cross-attention map, model generation capacity, and textual input. Features from Stable Diffusion UNet's upsampling layers and earlier denoising steps lead to more discriminative features for transfer learning to downstream tasks. Thirdly, we find that tuning the Stable Diffusion to downstream tasks in a parameter-efficient way is feasible. We search for the best protocol for effective tuning via reinforcement learning and finally achieve similar performance to full tuning by only tuning 0.81\% of Stable Diffusion's parameters.

# 74

Strong Double Blind

TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution Detection

Xixi Liu · Christopher Zach

Out-of-distribution (OOD) detection has been extensively studied for the reliable deployment of deep-learning models. Despite great progress in this research direction, most works focus on discriminative classifiers and perform OOD detection based on single-modal representations that consist of either visual or textual features. Moreover, they rely on training with in-distribution (ID) data. The emergence of vision-language models (e.g. \CLIPc) allows to perform zero-shot OOD detection by leveraging multi-modal feature embeddings and therefore only rely on labels defining ID data. Several approaches have been devised but these either need a given OOD label set, which might deviate from real OOD data, or fine-tune CLIP, which potentially has to be done for different ID datasets. In this paper, we first adapt various OOD scores developed for discriminative classifiers to \CLIP. Further, we propose an enhanced method named \emph{TAG} based on Text prompt AuGmentation to amplify the separation between ID and OOD data, which is simple but effective, and can be applied on various score functions. Its performance is demonstrated on CIFAR-100 and large-scale ImageNet-1k OOD detection benchmarks. It consistently improves AUROC and FPR95 on CIFAR-100 across five commonly used architectures over four baseline OOD scores. The average AUROC and FPR95 improvements are 6.35 % and 10.67 %, respectively. The results for ImageNet-1k follow a similar, but less pronounced pattern.

# 124

Strong Double Blind

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation

Xu Zheng · Yuanhuiyi Lyu · jiazhou zhou · LIN WANG

Fusing an arbitrary number of modalities is vital for achieving robust multi-modal fusion of semantic segmentation yet remains less explored to date. Recent endeavors regard RGB modality as the center and the others as the auxiliary, yielding an asymmetric architecture with two branches. However, the RGB modality may struggle in certain circumstances, e.g., nighttime, while others, e.g., event data, own their merits; thus, it is imperative for the fusion model to discern robust and fragile modalities, and incorporate the most robust and fragile ones to learn a resilient multi-modal framework. To this end, we propose a novel method, named MAGIC, that can be flexibly paired with various backbones, ranging from compact to high-performance models. Our method comprises two key plug-and-play modules. Firstly, we introduce a multi-modal aggregation module to efficiently process features from multi-modal batches and extract complementary scene information. On top, a unified arbitrary-modal selection module is proposed to utilize the aggregated features as the benchmark to rank the multi-modal features based on the similarity scores. This way, our method can eliminate the dependence on RGB modality and better overcome sensor failures while ensuring the segmentation performance. Under the commonly considered multi-modal setting, our method achieves state-of-the-art performance while reducing the model parameters by 60%. Moreover, our method is superior in the novel modality-agnostic setting, where it outperforms prior arts by a large margin of +19.41% mIoU.

# 88

Strong Double Blind

Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

MENGJUN CHENG · Chengquan Zhang · Chang Liu · Yuke Li · Bohan Li · Kun Yao · Xiawu Zheng · Rongrong Ji · Jie Chen

Current methodologies have achieved notable success in the closed-set visual information extraction task, while the exploration into open-vocabulary settings is comparatively underdeveloped, which is practical for individual users in terms of inferring information across documents of diverse types. Existing proposal solutions including NER methods and LLM-based methods fall short in processing the unlimited range of open-vocabulary keys and missing explicit layout modeling. This paper introduces a novel method for tackling the given challenge by transforming the process of categorizing text tokens into a task of locating regions based on given queries also called textual grounding. Particularly, we take this a step further by pairing open-vocabulary key language embedding with corresponding grounded text visual embedding. We design a document-tailored grounding framework by incorporating the layout-aware context learning and document-tailored two-stage pre-training, which significantly improves the model's understanding of documents. Our method outperforms current proposal solutions on the SVRD benchmark for the open-vocabulary VIE task, offering lower costs and faster inference speed. Specifically, our method infers 20x faster than the QwenVL model and achieves an improvement of about 24.3% for the F-score.

# 76

Region-centric Image-Language Pretraining for Open-Vocabulary Detection

Dahun Kim · Anelia Angelova · Weicheng Kuo

We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we incorporate the detector architecture on top of the classification backbone, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from large-scale image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 37.6 mask APr using the common ViT-L backbone and public LAION dataset, significantly outperforming the best existing approach by +3.7 mask APr at system level. On the COCO benchmark, we achieve very competitive 39.6 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where it demonstrates notable improvement over the baseline. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline. Code and models will be publicly released.

# 123

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Djamahl Etchegaray · Zi Helen Huang · Tatsuya Harada · Yadan Luo

In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53\% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The \texttt{source code} is made available in the supplementary material.

# 110

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Ozan Unal · Christos Sakaridis · Suman Saha · Luc Van Gool

3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, {and finally we employ multi-view ensembling to improve referred mask quality}. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization'' challenge.

# 90

Strong Double Blind

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang · Ruohan Dong · Jiayi Ji · Yiwei Ma · Haowei Wang · Xiaoshuai Sun · Rongrong Ji

Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. The framework initially identifies anchor points using cross-attention mechanisms and subsequently performs segmentation with self-attention to achieve zero-shot PNG. Moreover, we introduce a refinement module based on SAM to enhance the quality of the segmentation masks. Our extensive experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting, conclusively proving the diffusion model's capability for context-aware, phrase-level understanding.

# 87

Strong Double Blind

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Seonghoon Yu · Paul Hongsuck Seo · Jeany Son

We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo-supervisions for referring image segmentation (RIS). These pseudo-supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2)distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.

# 98

Strong Double Blind

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

Weitai Kang · Gaowen Liu · Shah Mubarak · Yan Yan

Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper, we present SegVG, a novel method transfers the box-level annotation as Segmentation signals to provide an additional pixel-level supervision for Visual Grounding. Specifically, we propose the Multi-layer Multi-task Encoder-Decoder as the target grounding stage, where we learn a regression query and multiple segmentation queries to ground the target by regression and segmentation of the box in each decoding layer, respectively. This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation. Moreover, as the backbones are typically initialized by pretrained parameters learned from unimodal tasks and the queries for both regression and segmentation are static learnable embeddings, a domain discrepancy remains among these three types of features, which impairs subsequent target grounding. To mitigate this discrepancy, we introduce the Triple Alignment module, where the query, text, and vision tokens are triangularly updated to share the same space by triple attention mechanism. Extensive experiments on five widely used datasets validate our state-of-the-art (SOTA) performance.

# 93

PSALM: Pixelwise Segmentation with Large Multi-modal Model

Zheng Zhang · YeYao Ma · Enming Zhang · Xiang Bai

PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. This schema includes images, task instructions, conditional prompts, and mask tokens, which enable the model to generate and classify segmentation masks effectively. The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization. PSALM achieves superior results on several benchmarks, such as RefCOCO/RefCOCO+/RefCOCOg, COCO Panoptic Segmentation, and COCO-Interactive, and further exhibits zero-shot capabilities on unseen tasks, such as open-vocabulary segmentation, generalized referring expression segmentation and video object segmentation, making a significant step towards a GPT moment in computer vision. Through extensive experiments, PSALM demonstrates its potential to transform the domain of image segmentation, leveraging the robust visual understanding capabilities of LMMs as seen in natural language processing.

# 83

Strong Double Blind

Grid-Attention: Enhancing Computational Efficiency of Large Vision Models without Fine-Tuning

Pengyu Li · Biao Wang · Tianchu Guo · Xian-Sheng Hua

Recently, transformer-based large vision models, \eg, the Segment Anything Model (SAM) and Stable Diffusion (SD), have achieved remarkable success in the computer vision field. However, the quartic complexity within the transformer's Multi-Head Attention (MHA) leads to substantial computational costs in these models whose inputs and outputs are high-resolution. Although several prior works attempted to alleviate this challenge, none have successfully reduced the complexity and latency of large vision models while preserving their remarkable capabilities without requiring enormous efforts and GPU hours to re-train or fine-tune the models. To address the challenge, we propose a simple yet effective plug-and-play transformer block called Grid-Attention(GridAttn). The GridAttn integrates the proposed Grid Clustering module, Grid Distributing strategies, and Grid Recovering module with common MHA to enhance the large vision models' computational efficiency and preserve their performance without the need for re-training or fine-tuning their parameters. We conduct extensive experiments on recent high-resolution tasks, including zero-shot instance segmentation (SAM, Expedit-SAM), text-to-image generation (Stable Diffusion V2.1), and semantic segmentation (SegFormer B0-B5). The experiments demonstrate that: Without any training or fine-tuning, GridAttn reduces GFlops by the range of [4.6%, 16.1%] and GPU inference latency by [8.2%, 21.4%], all while achieving equivalent performance (the performance bias ratio is less than 1%). Furthermore, the experiments present that GridAttn can also be trained from scratch or fine-tuned with very slight fine-tuning costs, resulting in a significantly improved performance-efficiency tradeoff. As a recommendation, we encourage the community to incorporate our GridAttn whenever deploying a well-trained transformer directly, fine-tuning a pre-trained one, or training a new one from scratch. The source code will be released.

# 82

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Kwanyoung Kim · Yujin Oh · Jong Chul Ye

The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. {However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, leading introduced multiple text prompts selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA), which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.

# 111

On the Viability of Monocular Depth Pre-training for Semantic Segmentation

DONG LAO · Fengyu Yang · Daniel Wang · Hyoungseob Park · Samuel Lu · Alex Wong · Stefano Soatto

The question of whether pre-training on geometric tasks is viable for downstream transfer to semantic tasks is important for two reasons, one practical and the other scientific. If the answer is positive, we may be able to reduce pre-training cost and bias from human annotators significantly. If the answer is negative, it may shed light on the role of embodiment in the emergence of language and other cognitive functions in evolutionary history. To frame the question in a way that is testable with current means, we pre-train a model on a geometric task, and test whether that can be used to prime a notion of “object” that enables inference of semantics as soon as symbols (labels) are assigned. We choose monocular depth prediction as the geometric task, and semantic segmentation as the downstream semantic task, and design a collection of empirical tests by exploring different forms of supervision, training pipelines, and data sources for both depth pre-training and semantic fine-tuning. We find that monocular depth IS a viable form of pre-training for semantic segmentation, validated by improvements over common baselines. Based on the findings, we propose several possible mechanisms behind the improvements, including their relation to dataset size, resolution, architecture, in/out-of-domain source data, and validate them through a wide range of ablation studies. We also find that optical flow, which at first glance may seem as good as depth prediction since it optimizes the same photometric reprojection error, is considerably less effective, as it does not explicitly aim to infer the latent structure of the scene, but rather the raw phenomenology of temporally adjacent images.

# 91

Strong Double Blind

Rethinking and Improving Visual Prompt Selection for In-Context Learning Segmentation Framework

Wei Suo · Lanqing Lai · Mengyang Sun · Hanwang Zhang · Peng Wang · Yanning Zhang

As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance.

# 78

Open-Vocabulary Camouflaged Object Segmentation

Youwei Pang · Xiaoqi Zhao · JiaMing Zuo · Lihe Zhang · Huchuan Lu

Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works have explored the utilization of pre-trained VLM for the challenging open-vocabulary dense prediction task that requires perceiving diverse objects with novel classes at inference time. Existing methods construct experiments based on the public datasets of related tasks, which are not tailored for open vocabulary and rarely involve imperceptible objects camouflaged in complex scenes due to data collection bias and annotation costs. To fill in the gaps, we introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS), and construct a large-scale complex scene dataset (\textbf{OVCamo}) containing 11,483 hand-selected images with fine annotations and corresponding object classes. Further, we build a strong single-stage open-vocabulary \underline{c}amouflaged \underline{o}bject \underline{s}egmentation transform\underline{er} baseline \textbf{OVCoser} attached to the parameter-fixed CLIP with iterative semantic guidance and structure enhancement. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects. Moreover, this effective framework also surpasses previous state-of-the-arts of open-vocabulary semantic image segmentation by a large margin on our OVCamo dataset. With the proposed dataset and baseline, we hope that this new task with more practical value can further expand the research on open-vocabulary dense prediction tasks. Our code and data can be found in the \href{https://github.com/lartpang/OVCamo}{link}.

# 107

Strong Double Blind

From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation

Yunfei Xie · Cihang Xie · Alan Yuille · Jieru Mei

We present a hierarchical transformer-based model for image segmentation that effectively links the segmentation of detailed parts to the broader context of object segmentation. Central to our approach is a multi-level representation, progressing from individual pixels to superpixels, and finally to cohesive groups. This progression is characterized by two key aggregation strategies: local aggregation for forming superpixels and global aggregation for clustering these superpixels into group tokens. The formation of superpixels through local aggregation taps into the redundancy of image data, yielding segments that align with image parts under object-level supervision. Conversely, the global aggregation process assembles these superpixels into groups that demonstrate a tendency to align with whole objects, especially when guided by part-level supervision. This methodology achieves an optimal balance between adaptability to different types of supervision and computational efficiency, leading to notable advancements in the segmentation of both parts and objects. When evaluated on the PartImageNet dataset, our approach surpasses the previous state-of-the-art by 2.8% and 0.8% in part and object mIoU scores, respectively. Similarly, on the Pascal Part dataset, it demonstrates improvements of 1.5% and 2.0% for part and object mIoU, respectively.

# 125

Strong Double Blind

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Anh Thai · Weiyao Wang · Hao Tang · Stefan Stojanov · James Rehg · Matt Feiszli

3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating interesting part label transfer ability across different object categories. We will release code to the community.

# 73

Strong Double Blind

Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation

Bjoern Michele · Alexandre Boulch · Tuan Hung Vu · Gilles Puy · Renaud Marlet · Nicolas Courty

We tackle the problem of source-free unsupervised domain adaptation (SFUDA) for 3D semantic segmentation. This challenging problem amounts to performing domain adaptation on an unlabeled target domain without any access to source data. The only available information is a model trained to achieve good performance on the source domain. Our first analysis reveals a pattern which commonly occurs with all SFUDA procedures: performance degrades after some training time, which is a by-product of an under-constrained and ill-posed problem. We discuss two strategies to alleviate this issue. First, we propose a sensible way to regularize the learning problem. Second, we introduce a novel criterion based on agreement with a reference model. It is used (1) to stop the training and (2) as validator to select hyperparameters. Our contributions are easy to implement and readily amenable for all SFUDA methods, ensuring stable improvements over all baselines. We validate our findings on various settings, achieving state-of-the-art performance.

# 63

Strong Double Blind

Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation

Shoumeng Qiu · Jie Chen · Xinrun Li · Ru Wan · Xiangyang Xue · Jian Pu

In this paper, we propose a novel knowledge distillation approach for the semantic segmentation task. Different from previous methods that rely on power-trained teachers or other modalities to provide additional knowledge, our approach does not require complex teacher models or information from extra sensors. Specifically, for the teacher model training, we propose to noise the label and then incorporate it into input to effectively boost the lightweight teacher performance. To ensure the robustness of the teacher model to the noise, we propose an effective dual-path consistency training strategy with a distance loss between the outputs of two paths. For the student model training, we keep it consistent with the standard distillation for simplicity. Our approach can effectively improve the performance of knowledge distillation and offers more flexibility in the choice of models between teachers and students. Extensive experiments on five challenging datasets including Cityscapes, ADE20K, PASCAL-VOC, COCO-Stuff 10K, and COCO-Stuff 164K, five popular models: FCN, PSPNet, DeepLabV3, STDC, and OCRNet, demonstrate the effectiveness and generalization of our approach. We believe that incorporating label into the input as shown in our work will bring insights into the related fields. The code is in the supplementary materials and will be released publicly upon acceptance.

# 64

Strong Double Blind

Mitigating Background Shift in Class-Incremental Semantic Segmentation

gilhan Park · WonJun Moon · SuBeen Lee · Tae-Young Kim · Jae-Pil Heo

Class-Incremental Semantic Segmentation (CISS) aims to learn new classes without forgetting the old ones, using only the labels of the new classes. To achieve this, two popular strategies are employed: 1) pseudo-labeling and knowledge distillation to preserve prior knowledge; and 2) background weight transfer, which leverages the broad coverage of background in learning new classes by transferring background weight to the new class classifier. However, the first strategy heavily relies on the old model in detecting old classes while undetected pixels are regarded as the background, thereby leading to the background shift towards the old classes (i.e., misclassification of old class as background). Additionally, in the case of the second approach, initializing the new class classifier with background knowledge triggers a similar background shift issue, but towards the new classes. To address these issues, we propose a background-class separation framework for CISS. To begin with, selective pseudo-labeling and adaptive feature distillation are to distill only trustworthy past knowledge. On the other hand, we encourage the separation between the background and new classes with a novel orthogonal objective along with label-guided output distillation. Our state-of-the-art results validate the effectiveness of these proposed methods.

# 109

Strong Double Blind

LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation

Jianan Li · Qiulei Dong

Precisely annotating large-scale 3D datasets for point cloud segmentation is laborious. To alleviate the annotation burden, several semi-supervised 3D segmentation methods have been proposed in literature. However, two issues remain to be tackled: 1) The utilization of large language-vision models (LVM) in semi-supervised 3D semantic segmentation remains under-explored. 2) The unlabeled points with low-confidence predictions are directly discarded by existing methods. Taking these two issues into consideration, we propose a language-assisted semi-supervised 3D semantic segmentation method named LASS3D, which is built upon the commonly used MeanTeacher framework. In LASS3D, we use two off-the-shelf LVM to generate multi-level captions and leverage the images as the bridge to connect the text data and point clouds. Then, a semantic-aware adaptive fusion module is explored in the student branch, where the semantic information encoded in the embeddings of multi-level captions is injected into 3D features by adaptive fusion and then the semantic information in the text-enhanced 3D features is transferred to the teacher branch by knowledge distillation. In addition, a progressive exploitation strategy is explored for the unreliable points in the teacher branch, which can effectively exploit the information encapsulated in unreliable points via negative learning. Experimental results on both outdoor and indoor datasets demonstrate that LASS3D outperforms the comparative methods in most cases. We will release our code upon publication.

# 115

Strong Double Blind

Point-supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance

Jing Li · Junsong Fan · Zhaoxiang Zhang

To bridge the gap between point labels and per-pixel labels, existing point-supervised panoptic segmentation methods usually estimate dense pseudo labels by assigning unlabeled pixels to corresponding instances according to rule-based pixel-to-instance distances. These rule-based distances involve the Dijkstra algorithm and cannot be optimized by point labels end to end, thus the distance results are usually suboptimal, which results in inaccurate pseudo labels. Here we propose to assign unlabeled pixels to corresponding instances based on a learnable distance metric. Specifically, we represent each instance as an anchor query, then predict the pixel-to-instance distance based on the cross-attention between anchor queries and pixel features through a distance branch, the predicted distance is supervised by point labels end to end. In order that each query can accurately represent the corresponding instance, we iteratively improve anchor queries through query aggregating and query enhancing processes, then improved distance results are predicted with these queries. We have experimentally demonstrated the effectiveness of our approach and achieved state-of-the-art results. Codes will be released upon acceptance.

# 119

Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

Duy Tho Le · Hengcan Shi · Jianfei Cai · Hamid Rezatofighi

Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce ``DifFUSER'', a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the inherent denoising property of diffusion, DifFUSER is able to refine or even synthesize sensor features in case of sensor malfunction, thereby improving the quality of the fused output. In terms of architecture, our DifFUSER blocks are chained together in a hierarchical BiFPN fashion, termed cMini-BiFPN, offering an alternative architecture for latent diffusion. We further introduce a Gated Self-conditioned Modulated (GSM) latent diffusion module together with Progressive Sensor Dropout Training (PSDT) paradigm, designed to add stronger conditioning to the diffusion process and robustness to sensor failure. Our extensive evaluations on the Nuscenes dataset reveals that DifFUSER not only achieves state-of-the-art performance with a 69.1% mIOU in BEV map segmentation tasks but also competes effectively with leading transformer-based fusion techniques in 3D object detection. The codebase is at [hidden]

# 65

Strong Double Blind

Zero-shot Object Counting with Good Exemplars

Huilin Zhu · Jingling Yuan · Zhengwei Yang · Yu Guo · Xian Zhong · Zheng Wang · Shengfeng He

Zero-shot object counting (ZOC) aims to enumerate objects in images using only the names of object classes during testing, without the need for manual annotations. However, a critical challenge in current ZOC methods lies in their inability to effectively identify high-quality exemplars. This deficiency hampers scalability across diverse classes and undermines the development of strong visual associations between the identified classes and image content. To this end, we propose the Visual Association-based Zero-shot Object Counting (VA-Count) framework. VA-Count consists of an Exemplar Enhancement Module (EEM) and a Noise Suppression Module (NSM) that synergistically refine the process of class exemplar identification while minimizing the consequences of incorrect object identification. The EEM utilizes advanced Vision-Language Pretaining models to discover potential exemplars, ensuring the framework's adaptability to various classes. Meanwhile, the NSM employs contrastive learning to differentiate between optimal and suboptimal exemplar pairs, reducing the negative effects of erroneous exemplars. The effectiveness and scalability of VA-Count in zero-shot contexts are demonstrated through its superior performance on three object counting datasets.

# 58

Strong Double Blind

SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection

Anay Majee · Ryan X Sharp · Rishabh Iyer

Confusion and forgetting of object classes have been challenges of prime interest in Few-Shot Object Detection (FSOD). To overcome these pitfalls in metric learning based FSOD techniques, we introduce a novel Submodular Mutual Information Learning (SMILe) framework which adopts combinatorial mutual information functions to enforce the creation of tighter and discriminative feature clusters in FSOD. Our proposed approach generalizes to several existing approaches in FSOD, agnostic of the backbone architecture demonstrating elevated performance gains. A paradigm shift from instance based objective functions to combinatorial objectives in SMILe naturally preserves the diversity within an object class resulting in reduced forgetting when subjected to few training examples. Furthermore, the application of mutual information between the already learnt (base) and newly added (novel) objects ensures sufficient separation between base and novel classes, minimizing the effect of class confusion. Experiments on popular FSOD benchmarks, PASCAL-VOC and MS-COCO show that our approach generalizes to State-of-the-Art (SoTA) approaches improving their novel class performance by up to 5.7% (3.3 mAP points) and 5.4% (2.6 mAP points) on the 10-shot setting of VOC (split 3) and 30-shot setting of COCO datasets respectively. Our experiments also demonstrate better retention of base class performance and up to 2x faster convergence over existing approaches agnostic of the underlying architecture.

# 55

Strong Double Blind

Enhancing Source-Free Domain Adaptive Object Detection with Low-confidence Pseudo Label Distillation

Ilhoon Yoon · Hyeongjun Kwon · Jin Kim · Junyoung Park · Hyunsung Jang · Kwanghoon Sohn

Source-free domain adaptation for Object Detection (SFOD) is a promising strategy for deploying trained detectors to new, unlabeled domains without accessing source data, addressing significant concerns around data privacy and efficiency. Most SFOD methods leverage a conventional Mean-Teacher (MT) self-training paradigm relying heavily on High-confidence Pseudo-Labels (HPL). However, these HPL often overlook objects that are unfamiliar across domains, leading to biased adaptation towards objects familiar to the source domain. To address this limitation, we introduce the Low-confidence Pseudo Label Distillation (LPLD) loss within the Mean-Teacher based SFOD framework. This novel approach is designed to leverage the proposals from Region Proposal Network (RPN), which potentially encompasses hard-to-detect objects in unfamiliar domains. Initially, we extract HPL using a standard pseudo-labeling technique and mine a set of Low-Confidence Pseudo Labels (LPL) from proposals generated by RPN, leaving those that do not overlap significantly with HPL. These LPL are further refined, and a LPLD loss is calculated to leverage class-relation information and reduce the effect of inherent noise. Furthermore, we use feature distance to adaptively weight the LPLD loss to focus on LPL containing more foreground area. Our method outperforms all other SFOD counterparts on four cross-domain object detection benchmarks. Extensive experiments demonstrate that our LPLD loss leads to effective adaptation by reducing false negatives and facilitating the use of general knowledge from the source model. Code is available at https://github.com/AnonymousPaperSource/paper11254.

# 126

MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection

Hongbin Lin · Yifan Zhang · SHUAICHENG NIU · Shuguang Cui · Zhen Li

Monocular 3D object detection (Mono 3Det) aims to identify 3D objects from a single RGB image. However, existing methods often assume training and test data follow the same distribution, which may not hold in real-world test scenarios. To address the out-of-distribution (OOD) problems, we explore a new adaptation paradigm for Mono 3Det, termed Fully Test-time Adaptation. It aims to adapt a well-trained model to unlabeled test data by handling potential data distribution shifts at test time without access to training data and test labels. However, applying this paradigm in Mono 3Det poses significant challenges due to OOD test data causing a remarkable decline in object detection scores. This decline conflicts with the pre-defined score thresholds of existing detection methods, leading to severe object omissions (i.e., rare positive detections and many false negatives). Consequently, the limited positive detection and plenty of noisy predictions cause test-time adaptation to fail in Mono 3Det. To handle this problem, we propose a novel Monocular Test-Time Adaptation (MonoTTA) method, based on two new strategies. 1) Reliability-driven adaptation: we empirically find that high-score objects are still reliable and the optimization of high-score objects can enhance confidence across all detections. Thus, we devise a self-adaptive strategy to identify reliable objects for model adaptation, which discovers potential objects and alleviates omissions. 2) Noise-guard adaptation: since high-score objects may be scarce, we develop a negative regularization term to exploit the numerous low-score objects via negative learning, preventing overfitting to noise and trivial solutions. Experimental results show that MonoTTA brings significant performance gains for Mono 3Det models in OOD test scenarios, approximately 190% gains by average on KITTI and 198% gains on nuScenes.

# 66

Strong Double Blind

AugDETR: Improving Multi-scale Learning for Detection Transformer

Jinpeng Dong · Yutong Lin · Chen Li · Sanping Zhou · Nanning Zheng

Current end-to-end detectors typically exploit transformers to detect objects and show promising performance. Among them, Deformable DETR is a representative paradigm that effectively exploits multi-scale features. However, small local receptive fields and limited query-encoder interactions weaken multi-scale learning. In this paper, we analyze local feature enhancement and multi-level encoder exploitation for improved multi-scale learning and construct a novel detection transformer detector named Augmented DETR (AugDETR) to realize them. Specifically, AugDETR consists of two components: Hybrid Attention Encoder and Encoder-Mixing Cross-Attention. Hybrid Attention Encoder enlarges the receptive field of the deformable encoder and introduces global context features to enhance feature representation. Encoder-Mixing Cross-Attention adaptively leverages multi-level encoders based on query features for more discriminative object features and faster convergence. By combining AugDETR with DETR-based detectors such as DINO, AlignDETR, DDQ, our models achieve performance improvements of 1.2, 1.1, and 1 AP in the COCO under the ResNet-50-4scale and 12 epochs setting, respectively.

# 131

Strong Double Blind

Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Suqi Song · Chenxu Zhang · Peng Zhang · Pengkun Li · Fenglong Song · Lei Zhang

Urban waterlogging poses a major risk to public safety and infrastructure. Conventional methods using water-level sensors need high-maintenance to hardly achieve full coverage. Recent advances employ surveillance camera imagery and deep learning for detection, yet these struggle amidst scarce data and adverse environmental conditions. In this paper, we establish a challenging Urban Waterlogging Benchmark (UW-Bench) under diverse adverse conditions to advance real-world applications. We propose a Large-Small Model co-adapter paradigm (LSM-adapter), which harnesses the substantial generic segmentation potential of large model and the specific task-directed guidance of small model. Specifically, a Triple-S Prompt Adapter module alongside a Dynamic Prompt Combiner are proposed to generate then merge multiple prompts for mask decoder adaptation. Meanwhile, a Histogram Equalization Adapter module is designed to infuse the image specific information for image encoder adaptation. Results and analysis show the challenge and superiority of our developed benchmark and algorithm.

# 118

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

Junjie Guo · Chenqiang Gao · Fangcen liu · Deyu Meng · Xinbo Gao

Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images. However, highly dynamically variable complementary characteristics and commonly existing modality misalignment make the fusion of complementary information difficult. In this paper, we propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to simultaneously address these two challenges. Specifically, we propose a Modality Competitive Query Selection strategy to provide useful prior information. This strategy can dynamically select basic salient modality feature representation for each object. To effectively mine the complementary information and adapt to misalignment situations, we propose a Multispectral Deformable Cross-attention module to adaptively sample and aggregate multi-semantic level features of infrared and visible images for each object. In addition, we further adopt the cascade structure of DETR to better mine complementary information. Experiments on four public datasets of different scenes demonstrate significant improvements compared to other state-of-the-art methods. The code will be released.

# 72

Strong Double Blind

PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation

Ning Gao · Sanping Zhou · Le Wang · Nanning Zheng

Semi-supervised learning has emerged as a widely adopted technique in the field of medical image segmentation. The existing works either focuses on the construction of consistency constraints or the generation of pseudo labels to provide high-quality supervisory signals, whose main challenge mainly comes from how to keep the continuous improvement of model capabilities. In this paper, we propose a simple yet effective semi-supervised learning framework, termed Progressive Mean Teachers(PMT), for medical image segmentation, whose goal is to generate high-fidelity pseudo labels by learning robust and diverse features in the training process. Specifically, our PMT employs a standard mean teacher to penalize the consistency of the current state and utilizes two sets of MT architectures for co-training. The two sets of MT architectures are individually updated for prolonged periods to maintain stable model diversity established through performance gaps generated by iteration differences. Additionally, a difference-driven alignment regularizer is employed to expedite the alignment of lagging models with the representation capabilities of leading models. Furthermore, a simple yet effective pseudo-label filtering algorithm is employed for facile evaluation of models and selection of high-fidelity pseudo-labels outputted when models are operating at high performance for co-training purposes. Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches across various dimensions.

# 70

ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image

Hallee E. Wong · Marianne Rakic · John Guttag · Adrian V. Dalca

Medical image segmentation is a crucial part of both scientific research and clinical care. With enough labelled data, deep learning models can be trained to accurately automate specific medical image segmentation tasks. However, manually segmenting images to create training data is highly labor intensive and requires domain expertise. We present ScribblePrompt, a flexible neural network based interactive segmentation tool for biomedical imaging that enables human annotators to segment previously unseen structures using scribbles, clicks, and bounding boxes. Through rigorous quantitative experiments, we demonstrate that given comparable amounts of interaction, ScribblePrompt produces more accurate segmentations than previous methods on datasets unseen during training. In a user study with domain experts, ScribblePrompt reduced annotation time by 28\% while improving Dice by 15\% compared to the next best method. ScribblePrompt's success rests on a set of careful design decisions. These include a training strategy that incorporates both a highly diverse set of images and tasks, novel algorithms for simulated user interactions and labels, and a network that enables fast inference. We showcase ScribblePrompt in an interactive online demo: https://huggingface.co/spaces/anon5167/ScribblePrompt

# 59

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

Yunlong Zhang · Honglin Li · YUXUAN SUN · Chenglu Zhu · Sunyi Zheng · Lin Yang

In the application of Multiple Instance Learning (MIL) methods for Whole Slide Image (WSI) classification, attention mechanisms often focus on a subset of discriminative instances, which are closely linked to overfitting. To mitigate overfitting, we present Attention-Challenging MIL (ACMIL). ACMIL combines two techniques based on separate analyses for attention value concentration. Firstly, UMAP of instance features reveals various patterns among discriminative instances, with existing attention mechanisms capturing only some of them. To remedy this, we introduce Multiple Branch Attention (MBA) to capture more discriminative instances using multiple attention branches. Secondly, examination of the cumulative value of Top-K attention scores indicates that a tiny number of instances dominate the majority of attention. In response, we present Stochastic Top-K Instance Masking (STKIM), which masks out a portion of instances with Top-K attention values and allocates their attention values to the remaining instances. The extensive experimental results on three WSI datasets with two pre-trained backbones reveal that our ACMIL outperforms state-of-the-art methods. Additionally, through heatmap visualization and UMAP visualization, this paper extensively illustrates ACMIL's effectiveness in suppressing attention value concentration and overcoming the overfitting challenge. The source code is available in the Supplementary Materials.

# 61

Strong Double Blind

GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation

Chenxin Li · Xinyu Liu · Cheng Wang · Yifan Liu · Weihao Yu · Jing Shao · Yixuan Yuan

Recent advances in learning multi-modal representation have witnessed the success in biomedical domains. While established techniques enable handling multi-modal information, the challenges are posed when extended to various clinical modalities and practical modality-missing setting due to the inherent modality gaps. To tackle these, we propose an innovative Modality-prompted Heterogeneous Graph Learning (MERGE), which embeds the numerous disparate clinical modalities into a unified representation, completes the deficient embedding of missing modality and reformulates the cross-modal learning with a graph-based aggregation. Specially, we establish a heterogeneous graph embedding to explicitly capture the diverse semantic properties on both the modality-specific features (nodes) and the cross-modal relations (edges).Then, we design a modality-prompted completion that enables completing the inadequate graph representation of missing modality through a graph prompting mechanism, which generates hallucination graphic topologies to steer the missing embedding towards the intact representation. Through the completed graph, we meticulously develop a knowledge-guided hierarchical cross-modal aggregation consisting of a global meta-path neighbouring to uncover the potential heterogeneous neighbors along the pathways driven by domain knowledge, and a local multi-relation aggregation module for the comprehensive cross-modal interaction across various heterogeneous relations. We assess the efficacy of our methodology on rigorous benchmarked experiments against prior state-of-the-arts. In a nutshell, MERGE represents an initial foray into the intriguing realm of embedding, relating and perceiving the heterogeneous patterns from the various disparate clinical modalities holistically via the graph theory

# 324

Strong Double Blind

R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection

Zheyuan Zhou · Wang Le · Naiyu Fang · Zili Wang · Lemiao Qiu · Shuyou Zhang

3D anomaly detection plays a crucial role in monitoring parts for localized inherent defects in precision manufacturing. Embedding-based and reconstruction-based approaches are among the most popular and successful methods. However, there are two major challenges to the practical application of the current approaches: 1) the embedded models suffer the prohibitive computational and storage due to the memory bank structure; 2) the reconstructive models based on the MAE mechanism fail to detect anomalous in the unmasked regions. In this paper, we propose R3D-AD, reconstructing anomalous point clouds by diffusion model for precise 3D anomaly detection. Our approach capitalizes on the data distribution conversion of the diffusion process to entirely obscure the input's anomalous geometry. It step-wisely learns a strict point-level displacement behavior, which methodically corrects the aberrant points. To increase the generalization of the model, we further present a novel 3D anomaly simulation strategy named Patch-Gen to generate realistic and diverse defect shapes, which narrows the domain gap between training and testing. Our R3D-AD ensures a uniform spatial transformation, which allows straightforwardly generating anomaly results by distance comparison. Extensive experiments show that our R3D-AD outperforms previous state-of-the-art methods, achieving 73.4% AUROC on the Real3D-AD dataset and 74.9% AUROC on the Anomaly-ShapeNet dataset with 4.7 FPS.

# 18

Strong Double Blind

Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation

Guan Gui · Bin-Bin Gao · Jun Liu · Chengjie Wang · Yunsheng Wu

Anomaly detection (including classification and segmentation) is a practical and challenging task due to the scarcity of anomaly samples in industrial inspection. Some existing anomaly detection methods address this issue by synthesizing anomalies with noise or external data. However, there is always a large semantic gap between synthetic anomalies and real-world anomalies, resulting in weak performance in anomaly detection. To solve the above problem, we propose a few-shot anomaly-driven generation method, which guides the diffusion model to generate more realistic and diverse anomalies with only a few real anomalies, thereby benefiting training anomaly detection models. Specifically, our work is divided into three stages. In the first stage, we learn the anomaly distribution based on a few given real anomalies and inject the learned knowledge into an embedding. In the second stage, we use the embedding and given bounding boxes to guide the diffusion model to generate realistic and diverse anomalies on specific objects (or textures). In the final stage, we propose a weakly-supervised anomaly detection method to train a more powerful model with generated anomalies. Our method builds upon DRAEM and DesTSeg as the foundation model and conducts experiments on the commonly used industrial anomaly detection dataset, MVTec. The experiments demonstrate that our generated anomalies effectively improve the model performance of both anomaly classification and segmentation tasks simultaneously, eg., DRAEM and DseTSeg achieved a 5.8% and 1.5% improvement in AU-PR metric on segmentation task, respectively. We will make the code and the generated anomalous images (70,760) available for reproducibility.

# 15

Continuous Memory Representation for Anomaly Detection

Joo Chan Lee · Taejune Kim · Eunbyung Park · Simon S Woo · Jong Hwan Ko

There have been significant advancements in anomaly detection in an unsupervised manner, where only normal images are available for training. Several recent methods aim to detect anomalies based on a memory, comparing or reconstructing the input with directly stored normal features (or trained features with normal images). However, such memory-based approaches operate on a discrete feature space implemented by the nearest neighbor or attention mechanism, suffering from poor generalization or an identity shortcut issue outputting the same as input, respectively. Furthermore, the majority of existing methods are designed to detect single-class anomalies, resulting in unsatisfactory performance when presented with multiple classes of objects. To tackle all of the above challenges, we propose CRAD, a novel anomaly detection method for representing normal features within a "continuous" memory, enabled by transforming spatial features into coordinates and mapping them to continuous grids. Furthermore, we carefully design the grids tailored for anomaly detection, representing both local and global normal features and fusing them effectively. Our extensive experiments demonstrate that CRAD successfully generalizes the normal features and mitigates the identity shortcut, furthermore, CRAD effectively handles diverse classes in a single model thanks to the high-granularity continuous representation. In an evaluation using the MVTec AD dataset, CRAD significantly outperforms the previous state-of-the-art method by reducing 65.0% of the error for multi-class unified anomaly detection.

# 16

Strong Double Blind

Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection

Haoyue Shi · Le Wang · Sanping Zhou · Gang Hua · Wei Tang

Unsupervised video anomaly detection (UVAD) aims to detect abnormal events in videos without any annotations. It remains challenging because anomalies are rare, diverse, and usually not well-defined. Existing UVAD methods are purely data-driven and perform unsupervised learning by identifying various abnormal patterns in videos. Since these methods largely rely on the feature representation and data distribution, they can only learn salient anomalies that are substantially different from normal events but ignore the less distinct ones. To address this challenge, this paper pursues a different approach that leverages data-irrelevant prior knowledge about normal and abnormal events for UVAD. We first propose a new normality prior for UVAD, suggesting that the start and end of a video are predominantly normal. We then propose normality propagation, which propagates normal knowledge based on relationships between video snippets to estimate the normal magnitudes of unlabeled snippets. Finally, unsupervised learning of abnormal detection is performed based on the propagated labels and a new loss re-weighting method. These components are complementary to normality propagation and mitigate the negative impact of incorrectly propagated labels. Extensive experiments on the ShanghaiTech and UCF-Crime benchmarks demonstrate the superior performance of our method. We plan to make the code and trained models publicly available.

# 84

Strong Double Blind

Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data

Jiayi Li · Xi-Le Zhao · Jian-Li Wang · Chao Wang · Min Wang

Recently, implicit neural representations (INRs) have attracted increasing attention for multi-dimensional data recovery. However, INRs simply map coordinates via a multi-layer perceptron (MLP) to corresponding values, ignoring the inherent semantic information of the data. To leverage semantic priors from the data, we propose a novel Superpixel-informed INR (S-INR). Specifically, we suggest utilizing generalized superpixel instead of pixel as an alternative basic unit of INR for multi-dimensional data (e.g., images and weather data). The coordinates of generalized superpixels are first fed into exclusive attention-based MLPs, and then the intermediate results interact with a shared dictionary matrix. The elaborately designed modules in S-INR allow us to ingenuously exploit the semantic information within and across generalized superpixels. Extensive experiments on various applications validate the effectiveness and efficacy of our S-INR compared to state-of-the-art INR methods.

# 86

Strong Double Blind

Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector

Xianren Zhang · Dongwon Lee · Suhang Wang

As deep vision models' popularity rapidly increases, there is a growing emphasis on explanations for model predictions. The inherently explainable attribution method aims to enhance the understanding of model behavior by identifying the important regions in images that significantly contribute to predictions. It is achieved by cooperatively training a selector (generating an attribution map to identify important features) and a predictor (making predictions using the identified features). Despite many advancements, existing methods suffer from the incompleteness problem, where discriminative features are masked out, and the interlocking problem, where the non-optimized selector initially selects background noise, causing the predictor to fit on this noise and perpetuate the cycle. To address these problems, we introduce a new objective that discourages the presence of discriminative features in the masked-out regions thus enhancing the comprehensiveness of feature selection. An external controller is introduced to detect discriminative features in the masked-out region. If the selector selects noise instead of discriminative features, the controller can observe and break the interlocking situation by penalizing the selector. Extensive experiments show that our model makes accurate predictions with higher accuracy than the regular black-box model, and produces attribution maps with high localization ability, fidelity and robustness.

# 89

Fairness-aware Vision Transformer via Debiased Self-Attention

Yao Qiang · Chengyin Li · Prashant Khanduri · Dongxiao Zhu

Vision Transformer (ViT) has recently gained significant attention in solving computer vision (CV) problems due to its capability of extracting informative features and modeling long-range dependencies through the attention mechanism. Whereas recent works have explored the trustworthiness of ViT, including its robustness and explainability, the issue of fairness has not yet been adequately addressed. We establish that the existing fairness-aware algorithms designed for CNNs do not perform well on ViT, which highlights the need for developing our novel framework via Debiased Self-Attention (DSA). DSA is a fairness-through-blindness approach that enforces ViT to eliminate spurious features correlated with the sensitive label for bias mitigation and simultaneously retain real features for target prediction. Notably, DSA leverages adversarial examples to locate and mask the spurious features in the input image patches with an additional attention weights alignment regularizer in the training objective to encourage learning real features for target prediction. Importantly, our DSA framework leads to improved fairness guarantees over prior works on multiple prediction tasks without compromising target prediction performance. The source code will be released upon acceptance.

# 105

Strong Double Blind

AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer

Zhuguanyu Wu · Jiaxin Chen · Hanwen Zhong · Di Huang · Yunhong Wang

Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks for various computer vision tasks, which however requires a high computational cost and inference latency. Recently, post-training quantization (PTQ) has emerged as a promising way to promote the efficiency of Vision Transformers. Nevertheless, existing PTQ approaches for ViTs suffer from the inflexible quantization on the post-Softmax and post-GeLU activations with power-law-like distributions. To address the issue, we propose a novel non-uniform quantizer dubbed AdaLog, which adapts the logarithmic base to accommodate the power-law-like distribution of activations, and simultaneously allows for hardware-friendly quantization and de-quantization. By further employing the bias reparameterization, AdaLog Quantizer is applicable to both the post-Softmax and post-GeLU activations. Moreover, we develop a Beam-Search strategy to select the best logarithm base for AdaLog quantizers, as well as those scaling factors and zero points for uniform quantizers. Extensive experimental results on public benchmarks demonstrate the effectiveness of our approach for various ViT-based architectures and vision tasks such as classification, object detection, and instance segmentation.

# 79

LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors

Saksham Suri · Matthew Walmer · Kamal Gupta · Abhinav Shrivastava

We present a simple self-supervised method to enhance the performance of ViT features for dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and compact postprocessing network that can be applied to enhance the features of any pre-trained ViT backbone. LiFT is fast and easy to train with a self-supervised objective, and it boosts the density of ViT features for minimal extra inference cost. Furthermore, we demonstrate that LiFT can be applied with approaches that use additional task-specific downstream modules, as we integrate LiFT with ViTDet for COCO detection and segmentation. Despite the simplicity of LiFT, we find that it is not simply learning a more complex version of bilinear interpolation. Instead, our LiFT training protocol leads to several desirable emergent properties that benefit ViT features in dense downstream tasks. This includes greater scale invariance for features, and better object boundary maps. By simply training LiFT for a few epochs, we show improved performance on keypoint correspondence, detection, segmentation, and object discovery tasks. Overall, LiFT provides an easy way to unlock the benefits of denser feature arrays for a fraction of the computational cost.

# 25

Strong Double Blind

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

Chiao-An Yang · Ziwei Liu · Raymond Yeh

Subsampling layers play a crucial role in deep nets by discarding a portion of an activation map to reduce its spatial dimensions. This encourages the deep net to learn higher-level representations. Contrary to this motivation, we hypothesize that the discarded activations are useful and can be incorporated on the fly to improve models' prediction. To validate our hypothesis, we propose a search and aggregate method to find useful activation maps to be used at test-time. We applied our approach to the task of image classification and semantic segmentation. Extensive experiments over nine different architectures on ImageNet, CityScapes, and ADE20K show that our method consistently improves model test-time performance. Additionally, it complements existing test-time augmentation techniques to provide further performance gains.

# 44

Modality Translation for Object Detection Adaptation without forgetting prior knowledge

Heitor Rapela Medeiros · Masih Aminbeidokhti · Fidel A Guerrero Pena · David Latortue · Eric Granger · Marco Pedersoli

A common practice in deep learning consists of training large neural networks on massive datasets to perform accurately for different domains and tasks. While this methodology may work well in numerous application areas, it only applies across modalities due to a larger distribution shift in data captured using different sensors. This paper focuses on the problem of adapting a large object detection model to one or multiple modalities while being efficient. To do so, we propose ModTr as an alternative to the common approach of fine-tuning large models. ModTr consists of adapting the input with a small transformation network trained to minimize the detection loss directly. The original model can therefore work on the translated inputs without any further change or fine-tuning to its parameters. Experimental results on translating from IR to RGB images on two well-known datasets show that this simple ModTr approach provides detectors that can perform comparably or better than the standard fine-tuning without forgetting the original knowledge. This opens the doors to a more flexible and efficient service-based detection pipeline in which, instead of using a different detector for each modality, a unique and unaltered server is constantly running, where multiple modalities with the corresponding translations can query it.

# 47

Strong Double Blind

Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition

Yurong Zhang · Honghao Chen · Zhang Xinyu · Xiangxiang Chu · Li Song

Parameter-efficient transfer learning (PETL) is a promising task, aiming to adapt the large-scale pretrained model to downstream tasks with a relatively modest cost. However, current PETL methods struggle in compressing computational complexity and bear heavy inference burden due to the complete forward process. This paper presents an efficient visual recognition paradigm, called Dynamic Adapter (Dyn-Adapter), that boosts PETL efficiency by subtly disentangling features in multiple levels. Our approach is simple: first, we devise a dynamic architecture with balanced early heads for multi-level feature extraction, along with adaptive training strategy. Second, we introduce a bidirectional sparsity strategy driven by the pursuit of powerful generalized ability. These qualities enable us to fine-tune efficiently and effectively: we reduce FLOPs during inference by 50%, while maintaining or even yielding higher recognition accuracy. Extensive experiments on diverse datasets and pretrained backbones demonstrate the potential of Dyn-Adapter serving as a general efficiency booster for PETL. We will make the code publicly available.

# 39

Strong Double Blind

Scaling Backwards: Minimal Synthetic Pre-training?

Ryo Nakamura · Ryu Tadokoro · Ryosuke Yamada · Yuki M Asano · Iro Laina · Christian Rupprecht · Nakamasa Inoue · Rio Yokota · Hirokatsu Kataoka

Pre-training and transfer learning are an important building block of current computer vision systems. While pre-training is usually performed on large real-world image datasets, in this paper we ask whether this is truly necessary. To this end, we search for a minimal, purely synthetic pre-training dataset that allows us to achieve performance similar to the 1 million images of ImageNet-1k. We construct such a dataset from a single fractal with perturbations. With this, we contribute three main findings. (i) We show that pre-training is effective even with minimal synthetic images, with performance on par with large-scale pre-training datasets like ImageNet-1k for full fine-tuning. (ii) We investigate the single parameter with which we construct artificial categories for our dataset. We find that while the shape differences can be indistinguishable to humans, they are crucial for obtaining strong performances. (iii) Finally, we investigate the minimal requirements for successful pre-training. Surprisingly, we find that a substantial reduction of synthetic images from 1k to 1 can even lead to an \textit{increase} in pre-training performance, a motivation to further investigate scaling backwards''. Finally, we extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect through shape augmentation. We find that the use of grayscale images and affine transformations allows even real images toscale backwards''. The code is available at https://github.com/SUPER-TADORY/1p-frac.

# 20

Strong Double Blind

EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification

Suorong Yang · Furao Shen · Jian Zhao

Data augmentation (DA) has been widely used to improve the generalization of deep neural networks. Despite its effectiveness, most existing DA methods typically employ augmentation operations with random magnitudes for each sample, which might inevitably introduce noises, induce distribution shifts, and elevate the risk of overfitting. In this paper, we propose EntAugment, a tuning-free and adaptive DA framework. Unlike previous work, EntAugment dynamically assesses and adjusts the augmentation magnitudes for each sample during training, leveraging insights into both the inherent complexities of training samples and the evolving status of deep models. Specifically, in EntAugment, the magnitudes are determined by the information entropy derived from the probability distribution obtained by applying the softmax function to the model's output. Additionally, to further enhance the efficacy of EntAugment, we extend upon EntAugment by introducing a novel entropy regularization term, EntLoss for better generalization. Theoretical analysis further demonstrates that EntLoss, compared to traditional cross-entropy loss, achieves closer alignment between the model distributions and underlying dataset distributions. Moreover, EntAugment and EntLoss can be utilized separately or jointly. We conduct extensive experiments across multiple image classification tasks and network architectures with thorough comparisons of existing DA methods. The proposed methods outperform others without introducing any auxiliary models or noticeable extra computational costs, highlighting both effectiveness and efficiency. Code will be made available soon.

# 50

Strong Double Blind

Training-Free Model Merging for Multi-target Domain Adaptation

Wenyi Li · Huan-ang Gao · Mingju Gao · Beiwen Tian · Rong Zhi · HAO ZHAO

In this paper, we study multi-target domain adaptation for enhancing the robustness of scene understanding models. While previous methods achieved commendable results through inter-domain consistency losses, they often assumed unrealistic simultaneous access to images from all target domains, overlooking constraints such as data transfer bandwidth limitations and data privacy concerns. Given these challenges, we pose the question: How to merge models adapted independently on distinct domains while bypassing the need for direct access to training data? Our solution to this problem involves two components, merging model parameters and merging model buffers (i.e., normalization layer statistics). For merging model parameters, empirical analyses of mode connectivity surprisingly reveal that linear merging suffices when employing the same pretrained backbone weights for adapting separate models. For merging model buffers, we model the real-world distribution with a Gaussian prior and estimate new statistics from the buffers of separately trained models. Our method is simple yet effective, achieving comparable performance with data combination training baselines, while eliminating the need for accessing training data. Our code and models will be made publicly available.

# 56

CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning

Ziyang Gong · FuHao Li · Yupeng Deng · Dr Deblina Bhattacharjee · Xianzheng Ma · Xiangwei Zhu · Zhenming Ji

Unsupervised Domain Adaptation (UDA) aims to adapt models from labeled source domains to unlabeled target domains. When adapting to adverse scenes, existing UDA methods fail to perform well due to the lack of instructions, leading their models to overlook discrepancies within all adverse scenes. To tackle this, we propose CoDA which instructs models to distinguish, focus, and learn from these discrepancies at scene and image levels. Specifically, CoDA consists of a Chain-of-Domain (CoD) strategy and a Severity-Aware Visual Prompt Tuning (SAVPT) mechanism. CoD focuses on scene-level instructions to divide all adverse scenes into \textit{easy} and \textit{hard} scenes, guiding models to adapt from source to easy domains with easy scene images, and then to hard domains with hard scene images, thereby laying a solid foundation for whole adaptations. Building upon this foundation, we employ SAVPT to dive into more detailed image-level instructions to boost performance. SAVPT features a novel metric \textit{Severity} that divides all adverse scene images into \textit{low-severity} and \textit{high-severity} images. Then Severity directs visual prompts and adapters, instructing models to concentrate on unified severity features instead of scene-specific features, without adding complexity to the model architecture. CoDA achieves SOTA performances on widely-used benchmarks under all adverse scenes. Notably, CoDA outperforms the existing ones by 4.6\%, and 10.3\% mIoU on the Foggy Driving, and Foggy Zurich benchmarks, respectively. We will make our code available upon acceptance.

# 52

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Jihai Zhang · Xiang Lan · Xiaoye Qu · Yu Cheng · Mengling Feng · Bryan Hooi

Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL's effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark. Codes can be found in the supplementary materials.

# 57

Strong Double Blind

Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

Marco Mistretta · Alberto Baldrati · Marco Bertini · Andrew Bagdanov

Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge.

# 60

Strong Double Blind

Semantic-guided Robustness Tuning for Few-Shot Transfer Across Extreme Domain Shift

kangyu xiao · Zilei Wang · junjie li

In this work, we focus on the cross-domain few-shot classification (CDFSC), which is mostly challenged by the low-data problem as well as extreme domain shift between base and novel target classes. Current methods always employ a lightweight backbone and continue to use a linear-probe-like traditional fine-tuning (Trad-FT) paradigm. While for recently emerging large-scale pre-trained model (LPM), which has more parameters with considerable prior knowledge, employing Trad-FT will face significant risks of overfitting and prior knowledge damage. In this paper, we propose semantic-guided robustness tuning (SRT), a novel fine-tuning paradigm including modulus-matching-based image-text mixup (MMIT-Mixup) and robustness-invariance fine-tuning (RI-FT), to address the CDFSC challenge of LPM. Concretely, SRT focuses on achieving robust class-specific representation. It first considers textual information as a robust and domain-invariant conductor, and MMIT-Mixup injects the domain-invariant and class-specific knowledge to obtain domain-invariant prototypes. Then, RI-FT optimizes the distance between features and prototypes to enhance the robustness of visual-encoder. We consider several types of LPMs and conduct extensive experiments, which reveals that SRT is a general solution for LPM’s CDFSC challenge and outperforms the existing methods with a large margin.

# 75

Strong Double Blind

Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts

Andong Tan · Fengtao Zhou · Hao Chen

The concept bottleneck model (CBM) is an interpretable-by-design framework that makes decisions by first predicting a set of interpretable concepts, and then predicting the class label based on the given concepts. Existing CBMs are trained with a fixed set of concepts (concepts are either annotated by the dataset or queried from language models). However, this closed-world assumption is unrealistic in practice, as users may wonder about the role of any desired concept in decision-making after the model is deployed. Inspired by the large success of recent vision-language pre-trained models such as CLIP in zero-shot classification, we propose OpenCBM'' to equip the CBM with open vocabulary concepts via: (1) Aligning the feature space of a trainable image feature extractor with that of a CLIP's image encoder via a prototype based feature alignment; (2) Simultaneously training an image classifier on the downstream dataset; (3) Reconstructing the trained classification head via any set of user-desired textual concepts encoded by CLIP's text encoder. To reveal potentially missing concepts from users, we further propose to iteratively find the closest concept embedding to the residual parameters during the reconstruction until the residual is small enough. To the best of our knowledge, ourOpenCBM'' is the first CBM with concepts of open vocabularies, providing users the unique benefit such as removing, adding, or replacing any desired concept to explain the model's prediction even after a model is trained. Moreover, our model significantly outperforms the previous state-of-the-art CBM by 9% in the classification accuracy on the benchmark dataset CUB-200-2011. The code will be released upon acceptance.

# 54

Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models

Nishad Singhi · Jae Myung Kim · Karsten Roth · Zeynep Akata

Concept Bottleneck Models (CBMs) ground image classification on human-understandable concepts to allow for interpretable model decisions. Crucially, the CBM design inherently allows for human interventions, in which expert users are given the ability to modify potentially misaligned concept choices to influence the decision behavior of the model in an interpretable fashion. However, existing approaches often require numerous human interventions per image to achieve strong performances, posing practical challenges in scenarios where obtaining human feedback is expensive. In this paper, we find that this is noticeably driven by an independent treatment of concepts during intervention, wherein a change of one concept does not influence the use of other ones in the model's final decision. To address this issue, we introduce a trainable concept intervention realignment module, which leverages concept relations to realign concept assignments post-intervention. Across standard, real-world benchmarks, we find that concept re-alignment can significantly improve intervention efficacy; significantly reducing the number of interventions needed to reach a target classification performance or concept prediction accuracy. In addition, it easily integrates into existing concept-based architectures without requiring changes to the models themselves. This reduced cost of human-model collaboration is crucial to enhance the feasibility of CBMs in resource-constrained environments.

# 19

Strong Double Blind

FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning

Saandeep Aathreya · Shaun Canavan

Identifying Out-of-distribution (OOD) data is becoming increasingly critical as the applications of deep learning methods in real-world expands. The primary challenge is to ensure the model does not make incorrect predictions on OOD data with high confidence. To address this, most recent methods leverage the intermediate activations to identify distinctive signature patterns between in-distribution and OOD data. Moreover, some methods make a normality assumptions in these activations and the corresponding OOD detection algorithms are constructed based on these assumptions. However, as the model grows in size and accuracy, iterating through all intermediate layers with normality assumption may not be accurate. To tackle this, we propose a novel flow-based contrastive learning algorithm that relies primarily on the penultimate feature layer. The key component of our method is transforming the features to conform to normal distribution while retaining the class-specific information. This allows for a more robust representation learning which can be utilized for OOD detection. Our method does not require any additional data or retraining the original model, which makes it suitable for real-world applications. We emperically demonstrate the competitive performance of our method on various benchmark dataset.

# 21

PixOOD: Pixel-Level Out-of-Distribution Detection

Tomas Vojir · Jan Sochman · Jiri Matas

We propose a dense image prediction out-of-distribution detection algorithm, called PixOOD, which does not require training on samples of anomalous data and is not designed for a specific application which helps avoiding traditional training biases. In order to model the complex intra-class variability of the in-distribution data at the pixel level, we propose an online data condensation algorithm which is more robust than standard K-means and is easily trainable through SGD. We evaluate PixOOD on a wide range of problems and achieve state-of-the-art results on four out of seven datasets. The source code will be released upon acceptance.

# 40

Strong Double Blind

Distributionally Robust Loss for Long-Tailed Multi-Label Image Classification

Dekun Lin · Zhe Cui · Rui Chen · Tailai Peng · xinran xie · Xiaolin Qin

The binary cross-entropy (BCE) loss function is widely utilized in multi-label classification (MLC) tasks, treating each label independently. The log-sum-exp pairwise (LSEP) loss, which emphasizes higher logits for positive classes over negative ones within a sample and accounts for label dependencies, has demonstrated effectiveness for MLC. However, our experiments suggest that its performance in long-tailed multi-label classification (LTMLC) appears to be inferior to that of BCE. In this study, we investigate the impact of the log-sum-exp operation on recognition and explore optimization avenues. Our observations reveal two primary shortcomings of LSEP that lead to its poor performance in LTMLC: 1) the indiscriminate use of label dependencies without consideration of the distribution shift between training and test sets, and 2) the overconfidence in negative labels with features similar to those of positive labels. To mitigate these problems, we propose a distributionally robust loss (DR), which includes class-wise LSEP and a negative gradient constraint. Additionally, our findings indicate that the BCE-based loss is somewhat complementary to the LSEP-based loss, offering enhanced performance upon integration. Extensive experiments conducted on two LTMLC datasets, VOC-LT and COCO-LT, demonstrate the consistent effectiveness of our proposed method. The code will be made publicly available shortly.

# 37

Strong Double Blind

Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data

Sneha Paul · Zachary Patterson · Nizar Bouguila

Semi-supervised learning (SSL) has shown its effectiveness in learning effective 3D representation from a small amount of labelled data while utilizing large unlabelled data. Traditional semi-supervised approaches rely on the fundamental concept of predicting pseudo-labels for unlabeled data and incorporating them into the learning process. However, we identify that the existing methods do not fully utilize all the unlabelled samples and consequently limit their potential performance. To address this issue, we propose AllMatch, a novel SSL-based 3D classification framework that effectively utilizes all the unlabelled samples. AllMatch comprises three modules: (1) an adaptive hard augmentation module that applies relatively hard augmentations to the high-confident unlabeled samples with lower loss values, thereby enhancing the contribution of such samples, (2) an inverse learning module that further improves the utilization of unlabelled data by learning what not to learn, and (3) a contrastive learning module that ensures learning from all the samples in both supervised and unsupervised settings. Comprehensive experiments on two popular 3D datasets demonstrate a performance improvement of up to 11.2% with 1% labelled data, surpassing the SOTA by a significant margin. Furthermore, AllMatch exhibits its efficiency in effectively leveraging all the unlabeled data, demonstrated by the fact that only 10% of labelled data reaches nearly the same performance as fully-supervised learning with all labelled data. The code of our work is available at: https://anonymous.4open.science/r/AllMatch.

# 42

GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

Ruijie Yao · Sheng Jin · Lumin Xu · Wang Zeng · Wentao Liu · Chen Qian · Ping Luo · Ji Wu

Multi-Label Image Recognition (MLIR) is a challenging task that aims to predict multiple object labels in a single image while modeling the complex relationships between labels and image regions. Although convolutional neural networks and vision transformers have succeeded in processing images as regular grids of pixels or patches, these representations are sub-optimal for capturing irregular and discontinuous regions of interest. In this work, we present the first fully graph convolutional model, Group K-nearest neighbor based Graph convolutional Network (GKGNet), which models the connections between semantic label embeddings and image patches in a flexible and unified graph structure. To address the scale variance of different objects and to capture information from multiple perspectives, we propose the Group KGCN module for dynamic graph construction and message passing. Our experiments demonstrate that GKGNet achieves state-of-the-art performance with significantly lower computational costs on the MS-COCO and VOC2007 datasets. We will release the code and models to facilitate future research in this area.

# 43

Strong Double Blind

Generalized Coverage for More Robust Low-Budget Active Learning

Wonho Bae · Junhyug Noh · Danica J. Sutherland

The ProbCover method of Yehuda et al. is a well-motivated algorithm for active learning in low-budget regimes, which attempts to cover the data distribution with balls of a given radius at selected data points. We demonstrate, however, that the performance of this algorithm is extremely sensitive to the choice of this radius hyper-parameter, and that tuning it is quite difficult, with the original heuristic frequently failing. We thus introduce (and theoretically motivate) a generalized notion of coverage, including ProbCover's objective as a special case, but also allowing smoother notions that are far more robust to hyper-parameter choice. We propose an efficient greedy method to optimize this coverage, generalizing ProbCover's algorithm; due to its close connection to kernel herding, we call it MaxHerding. The objective can also be optimized non-greedily through a variant of k-medoids, clarifying the relationship to other low-budget active learning methods. In comprehensive experiments, MaxHerding surpasses existing active learning methods across multiple low-budget image classification benchmarks, and does so with less computational cost than most competitive methods.

# 51

Strong Double Blind

Robust Nearest Neighbors for Source-Free Domain Adaptation under Class Distribution Shift

Antonio Tejero-de-Pablos · Riku Togashi · Mayu Otani · Shin’ichi Satoh

The goal of source-free domain adaptation (SFDA) is retraining a model fit on data from a source domain (e.g., drawings) to classify data from a target domain (e.g., photos) employing only the target samples. In addition to the domain shift, in a realistic scenario, the number of samples per class on source and target would also differ (i.e., class distribution shift, or CDS). Dealing label-less with CDS via target data only is challenging, and thus previous methods assume no class imbalance in the source data. We study the SFDA pipeline and, for the first time, propose a SFDA method that can deal with class imbalance in both source and target data. While pseudolabeling is the core technique in SFDA to estimate the distribution of the target data, it relies on nearest neighbors, which makes it sensitive to class distribution shifts (CDS). We are able to calculate robust nearest neighbors by leveraging additional generic features free of the source model’s CDS bias. This provides a “second-opinion” regarding which nearest neighbors are more suitable for adaptation. We evaluate our method using various types of features, datasets and tasks, outperforming previous methods in SFDA under CDS. Our code is available at https://github.com//.

# 45

Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery

Grzegorz Rypesc · Daniel Marczak · Sebastian Cygert · Tomasz Trzcinski · Bartlomiej Twardowski

Generalized Continual Category Discovery (GCCD) tackles learning from sequentially arriving, partially labeled datasets while uncovering new categories. Traditional methods depend on feature distillation to prevent forgetting the old knowledge. However, this strategy restricts the model's ability to adapt and effectively distinguish new categories. To address this, we introduce a novel technique integrating a learnable projector with feature distillation, thus enhancing model adaptability without sacrificing past knowledge. The resulting distribution shift of the previously learned categories is mitigated with the auxiliary category adaptation network. We demonstrate that while each component offers modest benefits individually, their combination -- dubbed CAMP (Category Adaptation Meets Projected distillation) -- significantly improves the balance between learning new information and retaining old. CAMP exhibits superior performance across several GCCD scenarios with or without exemplars. Additionally, CAMP translates to a well-established Class Incremental Learning setting, achieving state-of-the-art results.

# 46

Strong Double Blind

CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

Erum Mushtaq · Duygu Nur Yaldiz · Yavuz Faruk Bakman · Jie Ding · Chenyang Tao · Dimitrios Dimitriadis · Salman Avestimehr

Continual self-supervised learning (CSSL) learns a series of tasks sequentially on the unlabeled data. Catastrophic forgetting and task confusion are considered two main challenges of continual learning. While CSSL problem has been studied to address the catastrophic forgetting challenge, little work has been done to address the task confusion aspect. Through extensive experiments, we demonstrate that self-supervised learning (SSL) can make CSSL more susceptible to the task confusion problem, particularly in less diverse settings of class incremental learning because different classes belonging to different tasks are not trained concurrently. Motivated by this challenge, we present a novel cross-model feature Mixup (CroMo-Mixup) framework that addresses this issue through two key components: 1) Cross-Task data Mixup, which mixes samples across tasks to enhance negative sample diversity; and 2) Cross-Model feature Mixup, which learns similarities between embeddings obtained from current and old models of the mixed sample and the original images, respectively, to learn cross-task class contrast, and facilitate old knowledge retrieval. We evaluate the effectiveness of CroMo-Mixup to improve both Task-ID prediction and average linear accuracy across all tasks on three datasets, CIFAR10, CIFAR100, and tinyImageNet under different class-incremental learning settings. We validate the compatibility of CroMo-Mixup on four state-of-the-art SSL objectives.

# 29

Strong Double Blind

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

An Zhang · Han Wang · Xiang Wang · Tat-Seng Chua

Domain Generalization (DG), designed to enhance out-of-distribution (OOD) generalization, is all about learning invariance against domain shifts utilizing sufficient supervision signals. Yet, the scarcity of such labeled data has led to the rise of unsupervised domain generalization (UDG) — a more important yet challenging task in that models are trained across diverse domains in an unsupervised manner and eventually tested on unseen domains. UDG is fast gaining attention but is still far from well-studied. To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked AutoEncoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. At its core is the distillation of domain-invariant semantic features, which cannot be distinguished by domain classifier, while filtering out the domain-specific variations (for example, color schemes and texture patterns) that are unstable and redundant. Notably, DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders, offering dynamic data manipulation and representation level augmentation capabilities. Extensive experiments on four benchmark datasets (i.e., DomainNet, PACS, VLCS, and Colored MNIST) with both DG and UDG tasks demonstrate that DisMAE can achieve competitive OOD performance compared with the state-of-the-art DG and UDG baselines, which shed light on potential research lines for improving generalization ability with large-scale unlabeled data.

# 48

Strong Double Blind

Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion

Linlan Huang · Xusheng Cao · Haori Lu · Xialei Liu

Class-incremental learning is a challenging problem, where the goal is to train a model that can classify data from an increasing number of classes over time. With the advancement of vision-language pre-trained models such as CLIP, they demonstrate good generalization ability that allows them to excel in class-incremental learning with completely frozen parameters. However, further adaptation to downstream tasks by simply fine-tuning the model leads to severe forgetting. Most existing works with pre-trained models assume that the forgetting of old classes is uniform when the model acquires new knowledge. In this paper, we propose a method that leverages the textual features of class names to measure the degree of influence on old classes by new classes and adjusts their representations accordingly to reduce forgetting. In addition, we also propose a decomposed parameter fusion method for the adapter module. It can greatly reduce the forgetting caused by fine-tuning the adapter modules with new data. Experiments on several conventional benchmarks show that our method achieves state-of-the-art results.

# 41

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

Talfan Evans · Shreya Pathak · Hamza Merzic · Jonathan Richard Schwarz · Ryutaro Tanno · Olivier Henaff

Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow. Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples. Despite their appeal, these methods have yet to be widely adopted since no one algorithm has been shown to a) generalize across models and tasks b) scale to large datasets and c) yield overall FLOP savings when accounting for the overhead of data selection. In this work we propose a method which satisfies these three properties, leveraging small, cheap proxy models to estimate “learnability” scores for datapoints, which are used to prioritize data for the training of much larger models. As a result, our models require 46% and 51% fewer training updates and up to 25% less total computation to reach the same performance as uniformly-trained visual classifiers on JFT and multimodal models on ALIGN. Finally, we find our data-prioritization scheme to be complementary with recent data-curation and learning objectives, yielding a new state-of-the-art in several multimodal transfer tasks.

# 38

Strong Double Blind

Information Bottleneck Based Data Correction in Continual Learning

Shuai Chen · mingyi zhang · Junge Zhang · Kaiqi Huang

Continual Learning (CL) requires model to retain previously learned knowledge while learning new tasks. Recently, experience replay-based methods have made significant progress in addressing this challenge. These methods primarily select data from old tasks and store them in a buffer. When learning new task, they train the model using both the current and buffered data. However, the limited number of old data can lead to the model being influenced by new tasks. The repeated replaying of buffer data and the gradual discarding of old task data (unsampled data) also result in a biased estimation of the model towards the old tasks, causing overfitting issues. All these factors can affect the CL performance. Therefore, we propose a data correction algorithm based on the Information Bottleneck (IBCL) to enhance the performance of the replay-based CL system. This algorithm comprises two components: the \textit{Information Bottleneck Task Against Constraints} (IBTA), which encourages the buffer data to learn task-relevant features related to the old tasks, thereby reducing the impact of new tasks. The \textit{Information Bottleneck Unsampled Data Surrogate} (IBDS), which models the information of the unsampled data in the old tasks to alleviate data bias. Our method can be flexibly combined with most existing experience replay methods. We have verified the effectiveness of our method through a series of experiments, demonstrating its potential for improving the performance of CL algorithms.

# 49

Strong Double Blind

Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning

XINYUAN GAO · Songlin Dong · Yuhang He · Qiang Wang · Yihong Gong

The problem of Rehearsal-Free Continual Learning (RFCL) aims to continually learn new knowledge while preventing forgetting of the old knowledge, without storing any old samples and prototypes. The latest methods leverage large-scale pre-trained models as the backbone and use key-query matching to generate trainable prompts to learn new knowledge. However, the domain gap between the pre-training dataset and the downstream datasets can easily lead to inaccuracies in key-query matching prompt selection when directly generating queries using the pre-trained model, which hampers learning new knowledge. Thus, in this paper, we propose a \emph{beyond prompt learning} approach to the RFCL task, called Continual Adapter (C-ADA). It mainly comprises a parameter-extensible continual adapter layer (CAL) and a scaling and shifting (S\&S) module in parallel with the pre-trained model. C-ADA flexibly extends specific weights in CAL to learn new knowledge for each task and freezes old weights to preserve prior knowledge, thereby avoiding matching errors and operational inefficiencies introduced by key-query matching. To reduce the gap, C-ADA employs an S&S module to transfer the feature space from pre-trained datasets to downstream datasets. Moreover, we propose an orthogonal loss to mitigate the interaction between old and new knowledge. Our approach achieves significantly improved performance and training speed, outperforming the current state-of-the-art (SOTA) method. Additionally, we conduct experiments on domain-incremental learning, surpassing the SOTA, and demonstrating the generality of our approach in different settings. The source code is provided in the supplementary material.

# 34

Strong Double Blind

Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable

En-Hui Yang · Linfeng Ye

To protect intellectual property of a deep neural network (DNN), two knowledge distillation (KD) related concepts are proposed: distillable DNN and KD-resistant DNN. A DNN is said to be distillable if used as a black-box input-output teacher, it can be distilled by a KD method to train a student model so that the distilled student outperforms the student trained alone with label smoothing (LS student) in terms of accuracy. A DNN is said to be KD-resistant with respect to a specific KD method if used as a black-box input-output teacher, it cannot be distilled by that specific KD method to yield a distilled student outperforming LS student in terms of accuracy. A new KD method called Markov KD (MKD) is further presented. When applied to nasty teachers trained by self-undermining KD, MKD makes those nasty teachers fully distillable, although those nasty teachers are shown to be KD-resistant with respect to state-of-the-art KD methods existing in the literature before our work. When applied to normal teachers, MKD yields distilled students outperforming those trained by KD from the same normal teachers by a large margin. More interestingly, MKD is capable of transferring knowledge from teachers trained in one domain to students trained in another domain.

# 33

FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients

Shangchao Su · Bin Li · Xiangyang Xue

With the increasing availability of Foundation Models, federated tuning has garnered attention in the field of federated learning, utilizing data and computation resources from multiple clients to collaboratively fine-tune foundation models. However, in real-world federated scenarios, there often exist a multitude of heterogeneous clients with varying computation and communication resources, rendering them incapable of supporting the entire model fine-tuning process. In response to this challenge, we propose a novel federated tuning algorithm, FedRA. The implementation of FedRA is straightforward and can be seamlessly integrated into any transformer-based model without the need for further modification to the original model. Specifically, in each communication round, FedRA randomly generates an allocation matrix. For resource-constrained clients, it reorganizes a small number of layers from the original model based on the allocation matrix and fine-tunes using adapters. Subsequently, the server aggregates the updated adapter parameters from the clients according to the current allocation matrix into the corresponding layers of the original model. It is worth noting that FedRA also supports scenarios where none of the clients can support the entire global model, which is an impressive advantage. We conduct experiments on two large-scale image datasets, DomainNet and NICO++, under various non-iid settings. The results demonstrate that FedRA outperforms the compared methods significantly.

# 13

SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks

Peishen Yan · Hao Wang · Tao Song · Yang Hua · Ruhui Ma · Ningxin Hu · Mohammad Reza Haghighat · Haibing Guan

Federated Learning (FL) is becoming a popular paradigm for leveraging distributed data and preserving data privacy. However, due to the distributed characteristic, FL systems are vulnerable to Byzantine attacks that compromised clients attack the global model by uploading malicious model updates. With the development of layer-level and parameter-level fine-grained attacks, the attacks' stealthiness and effectiveness have been significantly improved. The existing defense mechanisms solely analyze the model-level statistics of individual model updates uploaded by clients to mitigate Byzantine attacks, which are ineffective against the fine-grained attacks due to unawareness or overreaction. To address this problem, we propose SkyMask, a new attack-agnostic robust FL system that firstly leverages fine-grained learnable masks to identify malicious model updates at the parameter-level. Specifically, the FL server freezes and multiplies the model updates uploaded by clients with the parameter-level masks, and trains the masks over a small clean dataset (i.e., root dataset) to learn the subtle difference between benign and malicious model updates in a high-dimension space. Our extensive experiments involve different models on three public datasets under state-of-the-art (SOTA) attacks, where the results show that SkyMask achieves up to 14% higher testing accuracy compared with SOTA defense strategies under the same attacks and successfully defends against attacks with malicious clients of a high fraction up to 80%. SkyMask will be open-sourced upon acceptance.

# 11

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference

Alind Khare · Animesh Agrawal · Aditya Annavajjala · Payman Behnam · Myungjin Lee · Hugo M Latapie · Alexey Tumanov

Neural Architecture Search (NAS) for Federated Learning (FL) is an emerging field. It automates the design and training of Deep Neural Networks (DNNs) when data cannot be centralized due to privacy, communication costs, and regulatory restrictions. Recent federated NAS methods not only reduce manual effort but also provide more accuracy than traditional FL methods like FedAvg. Despite the success, existing federated NAS methods fail to satisfy diverse deployment targets common in on-device inference like hardware, latency budgets, or variable battery. Most federated NAS methods search for only a limited range of archi- tectural patterns, repeat the same pattern in DNNs and thereby harm performance. Moreover, these methods incur prohibitive training costs to satisfy deployment targets. They perform the training and search of DNN architectures repeatedly for each case. We propose FedNasOdin to address these challenges. It decouples the training and search in federated NAS. FedNasOdin co-trains a large number of diverse DNN architectures contained inside one supernet in the FL setting. Post-training, clients perform NAS locally to find specialized DNNs by extracting different parts of the trained supernet with no additional training. FedNasOdin takes O(1) (instead of O(N)) cost to find specialized DNN architectures in FL for any N deployment targets. As part of FedNasOdin, we introduce MaxNet—a novel FL training algorithm that performs multi-objective federated optimization of a large number of DNN architectures (≈ 5 ∗ 10^18) under different client data distributions. Overall, FedNasOdin achieves upto 37.7% higher accuracy for the same MACs or upto 8.13x reduction in MACs for the same accuracy than existing federated NAS methods.

# 30

Strong Double Blind

Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap

Junhao Dong · Piotr Koniusz · Junxi Chen · Yew Soon Ong

Adversarial robustness generally relies on large-scale architectures and datasets alongside extensive adversary generation, hindering resource-efficient deployment. For scalable solutions, adversarially robust knowledge distillation has emerged as a principle strategy, facilitating the transfer of adversarial robustness from a large-scale teacher model to a lightweight student model. However, existing works focus solely on sample-to-sample alignment of features or predictions between the teacher and student models, overlooking the vital role of their statistical alignment. Thus, we propose a novel adversarially robust knowledge distillation method that integrates the alignment of feature distributions between the teacher and student backbones under adversarial and clean sample sets. To motivate our idea, for an adversarially trained model (e.g., student or teacher), we show that the adversarially robust accuracy (evaluated on testing adversarial samples under an increasing perturbation radius) correlates negatively with the gap between the feature variance evaluated on testing adversarial samples and testing clean samples. Such a negative correlation exhibits a strong linear trend, suggesting that aligning the feature covariance of the student model toward the feature covariance of the teacher model should improve the adversarial robustness of the student model by reducing the variance gap. A similar trend is observed by reducing the variance gap between the gram matrices of the student and teacher models. Extensive evaluations highlight the state-of-the-art adversarial robustness and natural performance of our method across diverse datasets and distillation scenarios.

# 17

Strong Double Blind

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

Mijoo Kim · Junseok Kwon

With the rapid advancement in the performance of deep neural networks (DNNs), there has been significant interest in deploying and incorporating artificial intelligence (AI) systems into real-world scenarios. However, many DNNs lack the ability to represent uncertainty, often exhibiting excessive confidence even when making incorrect predictions.To ensure the reliability of AI systems, particularly in safety-critical cases, DNNs should transparently reflect the uncertainty in their predictions. In this paper, we investigate robust post-hoc uncertainty calibration methods for DNNs within the context of multi-class classification tasks. While previous studies have made notable progress, they still face challenges in achieving robust calibration, particularly in scenarios involving out-of-distribution (OOD). We identify that previous methods lack adaptability to individual input data and struggle to accurately estimate uncertainty when processing inputs drawn from the wild dataset. To address this issue, we introduce a novel instance-wise calibration method based on an energy model. Our method incorporates energy scores instead of softmax confidence scores, allowing for adaptive consideration of DNN uncertainty for each prediction within a logit space. In experiments, we show that the proposed method consistently maintains robust performance across the spectrum, spanning from in-distribution to OOD scenarios, when compared to other state-of-the-art methods.

# 12

Strong Double Blind

Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective

Zhaoxin Wang · Handing Wang · Cong Tian · Yaochu Jin

Adversarial training (AT) has become an effective defense method against adversarial examples (AEs) and it is typically framed as a bi-level optimization problem. Among various AT methods, fast AT (FAT), which employs a single-step attack strategy to guide the training process, can achieve good robustness against adversarial attacks at a low cost. However, FAT methods suffer from the catastrophic overfitting problem, especially on complex tasks or with large-parameter models. In this work, we propose a FAT method termed FGSM-PCO, which mitigates catastrophic overfitting by averting the collapse of the inner optimization problem in the bi-level optimization process. FGSM-PCO generates current-stage AEs from the historical AEs and incorporates them into the training process using an adaptive mechanism. This mechanism determines an appropriate fusion ratio according to the performance of the AEs on the training model. Coupled with a loss function tailored to the training framework, FGSM-PCO can alleviate catastrophic overfitting and help the recovery of an overfitted model to effective training. We evaluate our algorithm across three models and three datasets to validate its effectiveness. Comparative empirical studies against other FAT algorithms demonstrate that our proposed method effectively addresses unresolved overfitting issues in existing algorithms.

# 6

Catastrophic Overfitting: A Potential Blessing in Disguise

MN Zhao · Lihe Zhang · Yuqiu Kong · Baocai Yin

Fast Adversarial Training (FAT) has gained increasing attention within the research community owing to its efficacy in improving adversarial robustness. Particularly noteworthy is the challenge posed by catastrophic overfitting (CO) in this field. Although existing FAT approaches have made strides in mitigating CO, the ascent of adversarial robustness occurs with a non-negligible decline in classification accuracy on clean samples. To tackle this issue, we initially employ the feature activation differences between clean and adversarial examples to analyze the underlying causes of CO. Intriguingly, our findings reveal that CO can be attributed to the feature coverage induced by a few specific pathways. By intentionally manipulating feature activation differences in these pathways with well-designed regularization terms, we can effectively mitigate and induce CO, providing further evidence for this observation. Notably, models trained stably with these terms exhibit superior performance compared to prior FAT work. On this basis, we harness CO to achieve `attack obfuscation', aiming to bolster model performance. Consequently, the models suffering from CO can attain optimal classification accuracy on both clean and adversarial data when adding random noise to inputs during evaluation. We also validate their robustness against transferred adversarial examples and the necessity of inducing CO to improve robustness. Hence, CO may not be a problem that has to be solved.

# 9

Strong Double Blind

Cocktail Universal Adversarial Attack on Deep Neural Networks

Shaoxin Li · Xiaofeng Liao · Xin Che · Xintong Li · Yong Zhang · Lingyang Chu

Deep neural networks (DNNs) for image classification are known to be susceptible to many diversified universal adversarial perturbations (UAPs), where each UAP successfully attacks a large but substantially different set of images. Properly combining the diversified UAPs can significantly improve the attack effectiveness, as the sets of images successfully attacked by different UAPs are complementary to each other. In this paper, we study this novel type of attack by developing a cocktail universal adversarial attack framework. The key idea is to train a set of diversified UAPs and a selection neural network at the same time, such that the selection neural network can choose the most effective UAP when attacking a new target image. Due to the simplicity and effectiveness of the cocktail attack framework, it can be generally used to significantly boost the attack effectiveness of many classic single-UAP methods that use a single UAP to attack all target images. The proposed cocktail attack framework is also able to perform real-time attacks as it does not require additional training or fine-tuning when attacking new target images. Extensive experiments demonstrate the outstanding performance of cocktail attacks.

# 10

Strong Double Blind

Unveiling Privacy Risks in Stochastic Neural Networks Training: Effective Image Reconstruction from Gradients

Yiming Chen · Xiangyu Yang · Nikos Deligiannis

Federated Learning (FL) provides a framework for collaborative training of deep learning models while preserving data privacy by avoiding sharing the training data. However, recent studies have shown that a malicious server can reconstruct training data from the shared gradients of traditional neural networks (NNs) in FL, via Gradient Inversion Attacks (GIAs) that emulate the client's training process. Contrary to earlier beliefs that Stochastic Neural Networks (SNNs) are immune to such attacks due to their stochastic nature (which makes the training process challenging to mimic), our findings reveal that SNNs are equally susceptible to GIAs as SNN gradients contain the information of stochastic components, allowing attackers to reconstruct and disclose those uncertain components. In this work, we play the role of an attacker and propose a novel attack method, named Inverting Stochasticity from Gradients (ISG), that can successfully reconstruct the training data by formulating the stochastic training process of SNNs as a variant of the traditional NN training process. Furthermore, to improve the fidelity of the reconstructed data, we introduce a feature constraint strategy. Extensive experiments validate the effectiveness of our GIA and suggest that perturbation-based defenses in forward propagation, such as using SNNs, fail to secure models against GIAs inherently. The source code of our work will be made available to the public on GitHub.

# 4

Strong Double Blind

Rethinking Data Bias: Dataset Copyright Protection via Embedding Class-wise Hidden Bias

Jinhyeok Jang · ByungOk Han · Jaehong Kim · Chan-Hyun Youn

Public datasets play a crucial role in advancing data-centric AI, yet they remain vulnerable to illicit uses. This paper presents `undercover bias,' a novel dataset watermarking method that can reliably identify and verify unauthorized data usage. Our approach is inspired by an observation that trained models often inadvertently learn biased knowledge and can function on bias-only data, even without any information directly related to a target task. Leveraging this, we deliberately embed class-wise hidden bias via unnoticeable watermarks, which are unrelated to the target dataset but share the same labels. Consequently, a model trained on this watermarked data covertly learns to classify these watermarks. The model's performance in classifying the watermarks serves as irrefutable evidence of unauthorized usage, which cannot be achieved by chance. Our approach presents multiple benefits: 1) stealthy and model-agnostic watermarks; 2) minimal impact on the target task; 3) irrefutable evidence of misuse; and 4) improved applicability in practical scenarios. We validate these benefits through extensive experiments and extend our method to fine-grained classification and image segmentation tasks. Our implementation is available at here.

# 8

CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing

Haibo Jin · Ruoxi Chen · Jinyin Chen · Haibin Zheng · Yang Zhang · Haohan Wang

The success of deep neural networks (DNNs) in real-world applications has benefited from abundant pre-trained models. However, the backdoored pre-trained models can pose a significant trojan threat to the deployment of downstream DNNs. Numerous backdoor detection methods have been proposed but are limited to two aspects: (1) high sensitivity on trigger size, especially on stealthy attacks (i.e., blending attacks and defense adaptive attacks); (2) rely heavily on benign examples for reverse engineering. To address these challenges, we empirically observed that trojaned behaviors triggered by various trojaned attacks can be attributed to the trojan path, composed of top-$k$ critical neurons with more significant contributions to model prediction changes. Motivated by it, we propose CatchBackdoor, a detection method against trojan attacks. Based on the close connection between trojaned behaviors and trojan path to trigger errors, CatchBackdoor starts from the benign path and gradually approximates the trojan path through differential fuzzing. We then reverse triggers from the trojan path, to trigger errors caused by diverse trojan attacks. Extensive experiments on MINST, CIFAR-10, and a-ImageNet datasets and 7 models (LeNet, ResNet, and VGG) demonstrate the superiority of CatchBackdoor over the state-of-the-art methods, in terms of (1) \emph{effective} - it shows better detection performance, especially on stealthy attacks ($\sim$ $\times$ 2 on average); (2) \emph{extensible} - it is robust to trigger size and can conduct detection without benign examples.