Track: Poster Session 1

# 119

Strong Double Blind

Bi-directional Contextual Attention for 3D Dense Captioning

Minjung Kim · Hyung Suk Lim · Soonyoung Lee · Bumsoo Kim · Gunhee Kim

3D dense captioning is a task to localize objects and generate descriptions for each object in a 3D scene. Recent approaches in 3D dense captioning have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationship that exists across the entire global scene (not only near the object itself), and second, it faces with contradicting objectives--where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce CSI, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with contextualized geometries (where the structural contexts relevant to each object is summarized) and contextualized objects (where the objects relevant to the summarized structural contexts are aggregated). This simple extension relieves previous methods from the contradicting objectives, enhancing both localization performance while enabling to aggregate contextual features throughout the global scene; thus improving caption generation performance simultaneously. Extensive experiments on two of the most widely-used 3D dense captioning datasets (ScanRefer and Nr3D) demonstrate that our proposed method achieves a significant improvement over prior methods.

# 75

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Zuyao Chen · Jinlin Wu · Zhen Lei · Zhaoxiang Zhang · Chang Wen Chen

Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications.Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework.

# 145

ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

Michael A Hobley · Victor Adrian Prisacariu

Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the query image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without needing human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset. MCAC is available at MCAC.active.vision and ABC123 is available at ABC123.active.vision.

# 158

Towards Scene Graph Anticipation

Rohith Peddi · Saksham Singh · Saurabh . · Parag Singla · Vibhav Gogate

Spatio-temporal scene graphs represent interactions in a video by decomposing scenes into individual objects and their pair-wise temporal relationships. Long-term anticipation of the fine-grained pair-wise relationships between objects is a challenging problem. To this end, we introduce the task of Scene Graph Anticipation (SGA). We adopt state-of-the-art scene graph generation methods as baselines to anticipate future pair-wise relationships between objects. In our proposed approaches, we leverage object-centric representations of relationships to reason about the observed video frames and model the evolution of relationships between objects. We take a continuous time perspective and model the latent dynamics of the evolution of object interactions using concepts of NeuralODE and NeuralSDE, respectively. We infer representations of future relationships by solving an Ordinary Differential Equation and a Stochastic Differential Equation, respectively. Extensive experimentation on the Action Genome dataset validates the efficacy of the proposed methods.

# 180

Strong Double Blind

OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

Yuchen Che · Ryo Furukawa · Asako Kanezaki

Category-level articulated object pose estimation task focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to accomplish the aforementioned task. Our model consistently generates canonical reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level pose that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate our approach achieves state-of-the-art performance to other self-supervised methods and comparable performance to other supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a novel real-world articulated object benchmark dataset.

# 146

Strong Double Blind

PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

Ananthu Aniraj · Cassio F. Dantas · Dino Ienco · Diego Marcos

Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.

# 157

Strong Double Blind

H-V2X: A Large Scale Highway Dataset for BEV Perception

Chang Liu · MingXu zhu · Cong Ma

Vehicle-to-everything (V2X) technology has become an area of interest in research due to the availability of roadside infrastructure perception datasets. However, these datasets primarily focus on urban intersections and lack data on highway scenarios. Additionally, the perception tasks in the datasets are mainly MONO 3D due to limited synchronized data across multiple sensors. To bridge this gap, we propose Highway-V2X (H-V2X), the first large-scale highway Bird's-Eye-View (BEV) perception dataset captured by sensors in the real world. The dataset covers over 100 kilometers of highway, with a diverse range of road and weather conditions. H-V2X consists of over 1.9 million fine-grained categorized samples in BEV space, captured by multiple synchronized cameras, with vector map provided. We performed joint 2D-3D calibrations to ensure correct projection and human labor was involved to ensure data quality. Furthermore, we propose three highly relevant tasks to the highway scenario: BEV detection, BEV tracking, and trajectory prediction. We conducted benchmarks for each task, and innovative methods incorporating vector map information were proposed. We hope that H-V2X and benchmark methods will facilitate highway BEV perception research direction. The dataset and codes will be released upon acceptance.

# 159

RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios

Wenhao Ding · Yulong Cao · DING ZHAO · Chaowei Xiao · Marco Pavone

Simulation plays a crucial role in the development of autonomous vehicles (AVs) due to the potential risks associated with real-world testing. Although significant progress has been made in the visual aspects of simulators, generating complex behavior among agents remains a formidable challenge. It is not only imperative to ensure realism in the scenarios generated but also essential to incorporate preferences and conditions to facilitate controllable generation for AV training and evaluation. Traditional methods, which rely mainly on memorizing the distribution of training datasets, often fail to generate unseen scenarios. Inspired by the success of retrieval augmented generation in large language models, we present RealGen, a novel retrieval-based in-context learning framework for traffic scenario generation. RealGen synthesizes new scenarios by combining behaviors from multiple retrieved examples in a gradient-free way, which may originate from templates or tagged scenarios. This in-context learning framework endows versatile generative capabilities, including the ability to edit scenarios, compose various behaviors, and produce critical scenarios. Evaluations show that RealGen offers considerable flexibility and controllability, marking a new direction in the field of controllable traffic scenario generation. Check our project website for more information: https://realgen.github.io/.

# 160

DriveLM: Driving with Graph Visual Question Answering

Chonghao Sima · Katrin Renz · Kashyap Chitta · Li Chen · Zhang Hanxue · Chengen Xie · Jens Beißwenger · Ping Luo · Andreas Geiger · Hongyang Li

We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen sensor configurations. Our question-wise ablation study shows that the performance gain comes from the rich annotation of prediction and planning QA pairs in the graph structure. To facilitate future research, all code, data, models and an official evaluation server are available to the public.

# 161

Strong Double Blind

Making Large Language Models Better Planners with Reasoning-Decision Alignment

Zhijian Huang · Tao Tang · Shaoxiang Chen · Sihao Lin · Zequn Jie · Lin Ma · Guangrun Wang · Xiaodan Liang

Data-driven approaches for autonomous driving (AD) have been widely adopted in the past decade but are confronted with dataset bias and uninterpretability. Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios.They find that the pretraining-finetuning paradigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning process can enhance explainability and scene understanding. However, such a popular strategy proves to suffer from the notorious problems of misalignment between the crafted CoTs against the consequent decision-making, which remains untouched by previous LLM-based AD methods. To address this problem, we motivate an end-to-end decision-making model based on multimodality-augmented LLM, which simultaneously executes CoT reasoning and carries out planning results. Furthermore, we propose a reasoning-decision alignment constraint between the paired CoTs and planning results, imposing the correspondence between reasoning and decision-making. Moreover, we redesign the CoTs to enable the model to comprehend complex scenarios and enhance decision-making performance. We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver. Experimental evaluations on the nuScenes and DriveLM-nuScnene benchmarks demonstrate the effectiveness of our RDA-Driver in enhancing the performance of end-to-end autonomous driving systems. Specifically, our RDA-Driver achieves state-of-the-art end-to-end planning performance on the NuScenes dataset with 0.80 L2 error and 0.32 collision rates, and also achieves leading results on challenging DriveLM-nuScnene benchmarks with 0.82 L2 error and 0.38 collision rate.

# 153

M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

Yingshuang Zou · Yikang Ding · Xi Qiu · Haoqian Wang · Haotian Zhang

This paper presents a novel self-supervised two-frame multi-camera metric depth estimation network, termed M^2Depth, which is designed to predict reliable scale-aware surrounding depth in autonomous driving. Unlike the previous works that use multi-view images from a single time-step or multiple time-step images from a single camera, M^2Depth takes temporally adjacent two-frame images from multiple cameras as inputs and produces high-quality surrounding depth. We first construct cost volumes in spatial and temporal domains individually and propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation. We additionally combine the neural prior from SAM features with internal features to reduce the ambiguity between foreground and background and strengthen the depth edges. Extensive experimental results on nuScenes and DDAD benchmarks show M^2Depth achieves state-of-the-art performance. The code will be made publicly available.

# 155

MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping

Jiacheng Chen · Yuefan Wu · Tan Jiaqi · Hang Ma · Yasutaka Furukawa

This paper presents a vector HD-mapping algorithm that formulates the mapping as a tracking task and uses a history of memory latents to ensure consistent reconstructions over time. Our method, MapTracker, accumulates a sensor stream into memory buffers of two latent representations: 1) Raster latents in the bird's-eye-view (BEV) space and 2) Vector latents over the road elements (i.e., road boundaries, dividers, and pedestrian zones). The approach borrows the query propagation paradigm from the tracking literature that explicitly associates tracked road elements from the previous frame to the current, while fusing a subset of memory latents selected with distance strides to further enhance temporal consistency. A vector latent is decoded to reconstruct the geometry of a road element. The paper further makes benchmark contributions by 1) Improving processing code for existing datasets to produce consistent ground truth with temporal alignments and 2) Augmenting existing mAP metrics with consistency checks. MapTracker significantly outperforms existing methods on both nuScenes and Agroverse2 datasets by over 8\% and 19\% on the conventional and the new consistency-aware metrics, respectively. The code and augmented benchmarks will be available.

# 154

Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction

Alexander Timans · Christoph-Nikolas Straehle · Kaspar Sakmann · Eric Nalisnick

Quantifying a model’s predictive uncertainty is essential for safety-critical applications such as autonomous driving. We consider quantifying such uncertainty for multi-object detection. In particular, we leverage conformal prediction to obtain uncertainty intervals with guaranteed coverage for object bounding boxes. One challenge in doing so is that bounding box predictions are conditioned on the object's class label. Thus, we develop a novel two-step conformal approach that propagates uncertainty in predicted class labels into the uncertainty intervals for the bounding boxes. This broadens the validity of our conformal coverage guarantees to include incorrectly classified objects, ensuring their usefulness when maximal safety assurances are required. Moreover, we investigate novel ensemble and quantile regression formulations to ensure the bounding box intervals are adaptive to object size, leading to a more balanced coverage across sizes. Validating our two-step approach on real-world datasets for 2D bounding box localization, we find that desired coverage levels are satisfied with actionably tight predictive uncertainty intervals.

# 148

Strong Double Blind

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

Miao Cao · Lishun Wang · Huan Wang · Xin Yuan

Video Snapshot Compressive Imaging (SCI) aims to use a low-speed 2D camera to capture high-speed scene as snapshot compressed measurements, followed by a reconstruction algorithm to reconstruct the high-speed video frames. State-of-the-art (SOTA) deep learning-based algorithms have achieved impressive performance, yet with heavy computational workload. Network quantization is a promising way to reduce computational cost. However, a direct low-bit quantization will bring large performance drop. To address this challenge, in this paper, we propose a simple low-bit quantization framework (dubbed Q-SCI) for the end-to-end deep learning-based video SCI reconstruction methods which usually consist of a feature extraction, feature enhancement and video reconstruction module. Specifically, we first design a high-quality feature extraction module and a precise video reconstruction module to extract and propagate high-quality features in the low-bit quantized model. In addition, to alleviate the information distortion of the Transformer branch in the quantized feature enhancement module, we introduce a shift operation on the query and key distributions to further bridge the performance gap. Comprehensive experimental results manifest that our Q-SCI framework achieves superior performance, e.g., Q-SCI (4-bit) can theoretically accelerate previous SOTA real-valued EfficientSCI-S by 7.8× with only 2.3% performance gap on the simulation testing datasets. Code of this paper will be released.

# 150

Strong Double Blind

Photon Inhibition for Energy-Efficient Single-Photon Imaging

Lucas Koerner · Shantanu Gupta · Atul N Ingle · Mohit Gupta

Single-photon cameras (SPCs) are emerging as sensors of choice for various challenging imaging applications. One class of SPCs based on the single-photon avalanche diode (SPAD) detects individual photons using an avalanche process; the raw photon data can then be processed to extract scene information under extremely low light, high dynamic range, and rapid motion. Yet, single-photon sensitivity in SPADs comes at a cost --- each photon detection consumes more energy than that of a CMOS camera. This avalanche power significantly limits sensor resolution and could restrict widespread adoption of SPAD-based SPCs. We propose a computational-imaging approach called photon inhibition to address this challenge. Photon inhibition strategically allocates detections in space and time based on downstream inference task goals and resource constraints. We develop lightweight, on-sensor computational inhibition policies that use past photon data to disable SPAD pixels in real-time, to select the most informative future photons. As case studies, we design policies tailored for image reconstruction and edge detection, and demonstrate, both via simulations and real SPC captured data, considerable reduction in photon detections (over 90% of photons) while maintaining task performance metrics. Our work raises the question of ``which photons should be detected?'', and paves the way for future energy-efficient single-photon imaging.

# 149

Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging

Zongliang Wu · Ruiying Lu · Ying Fu · Xin Yuan

Snapshot compressive spectral imaging reconstruction aims to reconstruct three-dimensional spatial-spectral images from a single-shot two-dimensional compressed measurement. Existing state-of-the-art methods are mostly based on deep unfolding structures but have intrinsic performance bottlenecks: i) the ill-posed problem of dealing with heavily degraded measurement, and ii) the regression loss-based reconstruction models being prone to recover images with few details. In this paper, we introduce a generative model, namely the latent diffusion model (LDM), to generate degradation-free prior to enhance the regression-based deep unfolding method by a two-stage training procedure. Furthermore, we propose a Trident Transformer (TT), which extracts correlations among prior knowledge, spatial, and spectral features, to integrate knowledge priors in deep unfolding denoiser, and guide the reconstruction for compensating high-quality spectral signal details. To our knowledge, this is the first approach to integrate physics-driven deep unfolding with generative LDM in the context of CASSI reconstruction. Numeric and visual comparisons on synthetic and real-world datasets illustrate the superiority of our proposed method in both reconstruction quality and computational efficiency. Code will be released.

# 151

Minimalist Vision with Freeform Pixels

Jeremy Klotz · Shree Nayar

A minimalist vision system uses the smallest number of pixels needed to solve a vision task. While traditional cameras use a large grid of square pixels, a minimalist camera uses freeform pixels that can take on arbitrary shapes to increase their information content. We show that the hardware of a minimalist camera can be modeled as the first layer of a neural network, where the subsequent layers are used for inference. Training the network for any given task yields the shapes of the camera's freeform pixels, each of which is implemented using a photodetector and an optical mask. We have designed minimalist cameras for monitoring indoor spaces (with 8 pixels), measuring room lighting (with 8 pixels), and estimating traffic flow (with 8 pixels). The performance demonstrated by these systems is on par with a traditional camera with orders of magnitude more pixels. Minimalist vision has two major advantages. First, it naturally tends to preserve the privacy of individuals in the scene since the captured information is inadequate for extracting visual details. Second, since the number of measurements made by a minimalist camera is very small, we show that it can be fully self-powered, i.e., function without an external power supply or a battery.

# 152

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

Yihan Wang · Lahav Lipson · Jia Deng

We introduce RAFT2, a faster, simpler, and more accurate RAFT for optical flow. Compared with RAFT, RAFT2 is supervised with a mixture of Laplace loss. It directly regresses an initial flow for faster convergence in recurrent refinements and introduces stereo pretraining to improve generalization. RAFT2 achieves state-of-the-art on Spring benchmark with 3.69 end-point-error (EPE) and 0.36 1-pixel outlier rate (1px), representing 22.9% and 17.8% error reduction from best-published results. In addition, RAFT2 obtains the best cross-dataset generalization on KITTI(train) and Spring(train). With its high efficiency, RAFT2 operates at least 2.3x faster than mainstream methods while maintaining competitive performance, advancing the state of recurrent refinement frameworks in optical flow estimation.

# 147

Strong Double Blind

Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection

Xinhao Luo · Man Yao · Yuhong Chou · Bo Xu · Guoqi Li

Brain-inspired Spiking Neural Networks (SNNs) have bio-plausibility and low-power advantages over Artificial Neural Networks (ANNs). Applications of SNNs are currently limited to simple classification tasks because of their poor performance. In this work, we focus on bridging the performance gap between ANNs and SNNs on object detection. Our design revolves around network architecture and spiking neuron. First, the overly complex module design causes spike degradation when the YOLO series is converted to the corresponding spiking version. We design a SpikeYOLO architecture to solve this problem by simplifying the vanilla YOLO and incorporating meta SNN blocks. Second, object detection is more sensitive to quantization errors in the conversion of membrane potentials into binary spikes by spiking neurons. To address this challenge, we design a new spiking neuron that activates Integer values during training while maintaining spike-driven by extending virtual timestep during inference. The proposed method is validated on both static and neuromorphic object detection datasets. On the static COCO dataset, we obtain 66.2% mAP@50 and 48.9% mAP@50:95, which is +15.0% and +18.7% higher than the prior state-of-the-art SNN, respectively. On the neuromorphic Gen1 dataset, we achieve 67.2% mAP@50, which is +8.2% and +2.5% greater than the existing best SNN model and ANN with equivalent architecture, respectively, and the energy efficiency is improved by 5.7 times.

# 183

Strong Double Blind

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

Akshay Krishnan · Abhijit Kundu · Kevis Maninis · James Hays · Matthew Brown

We propose OmniNOCS: a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS), object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCs datasets (NOCS-Real275, Wild6D). We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model (NOCSformer) that can predict accurate NOCS, instance masks and poses from 2D object detections across diverse classes. It is the first NOCS model that can generalize to a broad range of classes when prompted with 2D boxes. We evaluate our model on the task of 3D oriented bounding box prediction, where it achieves comparable results to state-of-the-art 3D detection methods such as Cube R-CNN. Unlike other 3D detection methods, our model also provides detailed and accurate 3D object shape and segmentation. We propose a novel benchmark for the task of NOCS prediction based on OmniNOCS, which we hope will serve as a useful baseline for future work in this area. Our dataset and code are at the project website: https://omninocs.github.io

# 198

Strong Double Blind

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Xiangyu Fan · Jiaqi Li · Zhiqian Lin · Weiye Xiao · Lei Yang

Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 hour, to 18.5 hours. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for VOCASET. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. We will release our code and dataset.

# 206

Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture

Xuanchen Li · Yuhao Cheng · Xingyu Ren · Haozhe Jia · Di Xu · Wenhan Zhu · Yichao Yan

4D head capture aims to generate dynamic topological meshes and corresponding texture maps from videos, which is widely utilized in movies and games for its ability to simulate facial muscle movements and recover dynamic textures in pore-squeezing. The industry often adopts the method involving multi-view stereo and non-rigid alignment. However, this approach is prone to errors and heavily reliant on time-consuming manual processing by artists. To simplify this process, we propose Topo4D, a novel framework for automatic geometry and texture generation, which optimizes densely aligned 4D heads and 8K texture maps directly from calibrated multi-view time-series images. Specifically, we first represent the time-series faces as a set of dynamic 3D Gaussians with fixed topology in which the Gaussian centers are bound to the mesh vertices. Afterward, we perform alternative geometry and texture optimization frame-by-frame for high-quality geometry and texture learning while maintaining temporal topology stability. Finally, we can extract dynamic facial meshes in regular wiring arrangement and high-fidelity textures with pore-level details from the learned Gaussians. Extensive experiments show that our method achieves superior results than the current SOTA face reconstruction methods both in the quality of meshes and textures. The code will be released publicly.

# 201

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

Zhenglin Zhou · Fan Ma · Hehe Fan · Zongxin Yang · Yi Yang

Creating digital avatars from textual prompts has long been a desirable yet challenging task. Despite the promising results achieved with 2D diffusion priors, current methods struggle to create high-quality and consistent animated avatars efficiently. Previous animatable head models like FLAME have difficulty in accurately representing detailed texture and geometry. Additionally, high-quality 3D static representations face challenges in dynamic driving with dynamic priors. In this paper, we introduce \textbf{HeadStudio}, a novel framework that utilizes 3D Gaussian splatting to generate realistic and animatable avatars from text prompts. Firstly, we associate 3D Gaussians with FLAME mesh priors, facilitating semantic animation on high-quality 3D static representations. To ensure the consistent animation, we further introduce the fine-grained landmark-based conditions, which are obtained from head prior model for regularizing consistency in animation-based training. Extensive experiments demonstrate the efficacy of HeadStudio in generating animatable avatars from textual prompts, exhibiting appealing appearances. The avatars are capable of rendering high-quality real-time ($\geq 40$ fps) novel views at a resolution of 1024. Moreover, These avatars can be smoothly driven by real-world speech and video. We hope that HeadStudio can enhance digital avatar creation and gain popularity in the community.

# 210

MagicMirror: Fast and High-Quality Avatar Generation with Constrained Search Space

Armand Comas Massague · Di Qiu · Menglei Chai · Marcel C. Bühler · Amit Raj · Ruiqi Gao · Qiangeng Xu · Mark J Matthews · Paulo Gotardo · Sergio Orts Escolano · Thabo Beeler

We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled visual quality and better adherence to input text prompts.

# 247

Personalized Video Relighting With an At-Home Light Stage

Jun Myeong Choi · Max Christman · Roni Sengupta

In this paper, we develop a personalized video relighting algorithm that produces high-quality and temporally consistent relit videos under any pose, expression, and lighting condition in real-time. Existing relighting algorithms typically rely either on publicly available synthetic data, which yields poor relighting results or on actual light stage data which is difficult to acquire. We show that by just capturing recordings of a user watching YouTube videos on a monitor we can train a personalized algorithm capable of performing high-quality relighting under any condition. Our key contribution is a novel image-based neural relighting architecture that effectively separates the intrinsic appearance features - the geometry and reflectance of the face - from the source lighting and then combines them with the target lighting to generate a relit image. This neural architecture enables smoothing of intrinsic appearance features leading to temporally stable video relighting. Both qualitative and quantitative evaluations show that our architecture improves portrait image relighting quality and temporal consistency over state-of-the-art approaches on both casually captured Light Stage at Your Desk' (LSYD) and light-stage-capturedOne Light At a Time' (OLAT) datasets.

# 254

Strong Double Blind

Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations

Tomáš Chobola · Yu Liu · Hanyi Zhang · Julia A Schnabel · Tingying Peng

Current deep learning-based low-light image enhancement methods often struggle with high-resolution images, and fail to meet the practical demands of visual perception across diverse and unseen scenarios. In this paper, we introduce a novel approach termed CoLIE, which redefines the enhancement process through mapping the 2D coordinates of an underexposed image to its illumination component, conditioned on local context. We propose a reconstruction of enhanced-light images within the HSV space utilizing an implicit neural function combined with an embedded guided filter, thereby significantly reducing computational overhead. Moreover, we introduce a single image-based training loss function to enhance the model's adaptability to various scenes, further enhancing its practical applicability. Through rigorous evaluations, we analyze the properties of our proposed framework, demonstrating its superiority in both image quality and scene adaptability. Furthermore, our evaluation extends to applications in downstream tasks within low-light scenarios, underscoring the practical utility of CoLIE.

# 277

Strong Double Blind

Panel-Specific Degradation Representation for Raw Under-Display Camera Image Restoration

Youngjin Oh · Keuntek Lee · Jooyoung Lee · Dae-Hyun Lee · Nam Ik Cho

Under-display camera (UDC) image restoration aims to restore images distorted by the OLED display panel covering the frontal camera on a smartphone. Previous deep learning-based UDC restoration methods focused on restoring the image within the RGB domain with the collection of real or synthetic RGB datasets. However, UDC images in these datasets exhibit domain differences from real commercial smartphone UDC images while inherently constraining the problem and solution within the RGB domain. To address this issue, we collect well-aligned sensor-level real UDC images using panels from two commercial smartphones equipped with UDC. We also propose a new UDC restoration method to exploit the disparities between degradations caused by different panels, considering that UDC degradations are specific to the type of OLED panel. For this purpose, we train an encoder with an unsupervised learning scheme using triplet loss that aims to extract the inherent degradations caused by different panels from degraded UDC images as implicit representations. The learned panel-specific degradation representations are then provided as priors to our restoration network based on an efficient Transformer network. Extensive experiments show that our proposed method achieves state-of-the-art performance on our real raw image dataset and generalizes well to previous datasets. Our dataset and code will be released.

# 261

Strong Double Blind

HoloADMM: High-Quality Holographic Complex Field Recovery

Mazen Mel · Paul Springer · Pietro Zanuttigh · Haitao Zhou · Alexander Gatto

Holography enables intriguing microscopic imaging modalities, particularly through Quantitative Phase Imaging (QPI), which utilizes the phase of coherent light as a way to reveal the contrast in transparent and thin microscopic specimens. Despite the limitation of image sensors, which detect only light intensity, phase information can still be recorded within a two-dimensional interference pattern between two distinct light waves. Numerical reconstruction is later needed to retrieve the amplitude and phase from such holographic measurements. To this end, we introduce HoloADMM, a novel interpretable, learning-based approach for in-line holographic image reconstruction. HoloADMM enhances imaging capability with spatial image super-resolution, offering a versatile framework that accommodates multiple illumination wavelengths and supports extensive refocusing ranges with up to 10 um precision. Our results indicate a substantial improvement in reconstruction quality over existing methods and demonstrate HoloADMM's effective adaptation to real holographic data captured by our Digital in-line Holographic Microscope (DIHM). This work not only advances holographic imaging techniques but also broadens the potential for non-invasive microscopic analysis applications.

# 246

Flying with Photons: Rendering Novel Views of Propagating Light

Anagh Malik · Noah Juravsky · Ryan Po · Gordon Wetzstein · Kyros Kutulakos · David Lindell

We present an imaging and neural rendering technique that seeks to synthesize videos of light propagating through a scene from novel, moving camera viewpoints. Our approach relies on a new ultrafast imaging setup to capture a first-of-its kind, multi-viewpoint video dataset with picosecond-level temporal resolution. Combined with this dataset, we introduce an efficient neural volume rendering framework based on the transient field. This field is defined as a mapping from a 3D point and 2D direction to a high-dimensional, discrete-time signal that represents time-varying radiance at ultrafast timescales. Rendering with transient fields naturally accounts for effects due to the finite speed of light, including viewpoint-dependent appearance changes caused by light propagation delays to the camera. We render a range of complex effects, including scattering, specular reflection, refraction, and diffraction. Additionally, we demonstrate removing viewpoint-dependent propagation delays using a time warping procedure, rendering of relativistic effects, and video synthesis of direct and global components of light transport.

# 281

Strong Double Blind

Efficient Depth-Guided Urban View Synthesis

sheng miao · Jiaxin Huang · Dongfeng Bai · Weichao Qiu · Liu Bingbing · Andreas Geiger · Yiyi Liao

Recent advances in implicit scene representation enable high-fidelity street view novel view synthesis. However, existing methods optimize a neural radiance field for each scene, relying heavily on dense training images and extensive computation resources. To mitigate this shortcoming, we introduce a new method called Efficient Depth-Guided Urban View Synthesis (EDUS) for fast feed-forward inference and efficient per-scene fine-tuning. Different from prior generalizable methods that infer geometry based on feature matching, EDUS leverages noisy predicted geometric priors as guidance to enable generalizable urban view synthesis from sparse input images. The geometric priors allow us to apply our generalizable model directly in the 3D space, gaining robustness across various sparsity levels. Through comprehensive experiments on the KITTI-360 and Waymo datasets, we demonstrate promising generalization abilities on novel street scenes. Moreover, our results indicate that EDUS achieves state-of-the-art performance in sparse view settings when combined with fast test-time optimization.

# 279

Strong Double Blind

Ray-Distance Volume Rendering for Neural Scene Reconstruction

Ruihong Yin · Yunlu Chen · Sezer Karaoglu · Theo Gevers

Existing methods in neural scene reconstruction utilize the Signed Distance Function (SDF) to model the density function. However, in indoor scenes, the density computed from the SDF for a sampled point may not consistently reflect its real importance in volume rendering, often due to the influence of neighboring objects. To tackle this issue, our work proposes a novel approach for indoor scene reconstruction, which instead parameterizes the density function with the Signed Ray Distance Function (SRDF). Firstly, the SRDF is predicted by the network and transformed to a ray-conditioned density function for volume rendering. We argue that the ray-specific SRDF only considers the surface along the camera ray, from which the derived density function is more consistent to the real occupancy than that from the SDF. Secondly, although SRDF and SDF represent different aspects of scene geometries, their values should share the same sign indicating the underlying spatial occupancy. Therefore, this work introduces a SRDF-SDF consistency loss to constrain the signs of the SRDF and SDF outputs. Thirdly, this work proposes a self-supervised visibility task, introducing the physical visibility geometry to the reconstruction task. The visibility task combines prior from predicted SRDF and SDF as pseudo labels, and contributes to generating more accurate 3D geometry. Our method implemented with different representations has been validated on indoor datasets, achieving improved performance in both reconstruction and view synthesis.

# 236

Taming Latent Diffusion Model for Neural Radiance Field Inpainting

Chieh Lin · Changil Kim · Jia-Bin Huang · Qinbo Li · Chih-Yao Ma · Johannes Kopf · Ming-Hsuan Yang · Hung-Yu Tseng

Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift that is incoherent to the image condition due to the auto-encoding error. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose to temper the stochasticity of the diffusion model with per-scene customization and mitigate the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes.

# 271

Strong Double Blind

Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors

Wen Yuan Zhang · Kanle Shi · Yushen Liu · Zhizhong Han

Unsigned distance functions (UDFs) have been a vital representation for open surfaces. With different differentiable renderers, current methods are able to train neural networks to infer a UDF by minimizing the rendering errors on the UDF to the multi-view ground truth. However, these differentiable renderers are mainly handcrafted, which makes them either biased on ray-surface intersections, or sensitive to unsigned distance outliers, or not scalable to large scale scenes. To resolve these issues, we present a novel differentiable renderer to infer UDFs more accurately. Instead of using handcrafted equations, our differentiable renderer is a neural network which is pre-trained in a data-driven manner. It learns how to render unsigned distances into depth images, leading to a prior knowledge, dubbed volume rendering priors. To infer a UDF for an unseen scene from multiple RGB images, we generalize the learned volume rendering priors to map inferred unsigned distances in alpha blending for RGB image rendering. Our results show that the learned volume rendering priors are unbiased, robust, scalable, 3D aware, and more importantly, easy to learn. We evaluate our method on both widely used benchmarks and real scenes, and report superior performance over the state-of-the-art methods.

# 273

Strong Double Blind

GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer

Youngho Yoon · Hyun-Kurl Jang · Kuk-Jin Yoon

Novel view synthesis (NVS) aims to generate images at arbitrary viewpoints using multi-view images, and recent insights from neural radiance fields (NeRF) have contributed to remarkable improvements. Recently, studies on generalizable NeRF (G-NeRF) have addressed the challenge of per-scene optimization in NeRFs. The construction of radiance fields on-the-fly in G-NeRF simplifies the NVS process, making it well-suited for real-world applications. Meanwhile, G-NeRF still struggles in representing fine details for a specific scene due to the absence of per-scene optimization, even with texture-rich multi-view source inputs. As a remedy, we propose a Geometry-driven Multi-reference Texture transfer network (GMT) available as a plug-and-play module designed for G-NeRF. Specifically, we propose ray-imposed deformable convolution (RayDCN), which aligns input and reference features reflecting scene geometry. Additionally, the proposed texture preserving transformer (TPFormer) aggregates multi-view source features while preserving texture information. Consequently, our module enables direct interaction between adjacent pixels during the image enhancement process, which is deficient in G-NeRF models with an independent rendering process per pixel. This addresses constraints that hinder the ability to capture high-frequency details. Experiments show that our plug-and-play module consistently improves G-NeRF models on various benchmark datasets. The code will be publicly available soon.

# 275

Strong Double Blind

MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References

Lukas Bösiger · Mihai Dusmanu · Marc Pollefeys · Zuria Bauer

Rendering realistic images from 3D reconstruction is an essential task of many Computer Vision and Robotics pipelines, notably for mixed-reality applications as well as training autonomous agents in simulated environments. However, the quality of novel views heavily depends of the source reconstruction which is often imperfect due to noisy or missing geometry and appearance. Inspired by the recent success of reference-based super-resolution networks, we propose MaRINeR, a refinement method that leverages information of a nearby mapping image to improve the rendering of a target viewpoint. We first establish matches between the raw rendered image of the scene geometry from the target viewpoint and the nearby reference based on deep features, followed by hierarchical detail transfer. We show improved renderings in quantitative metrics and qualitative examples from both explicit and implicit scene representations. We further employ our method on the downstream tasks of pseudo-ground-truth validation, synthetic data enhancement and detail recovery for renderings of reduced 3D reconstructions.

# 284

UNIKD: UNcertainty-Filtered Incremental Knowledge Distillation for Neural Implicit Representation

Mengqi GUO · Chen Li · Hanlin Chen · Gim Hee Lee

Recent neural implicit representations (NIRs) have achieved great success in the tasks of 3D reconstruction and novel view synthesis. However, they require the images of a scene from different camera views to be available for one-time training. This is expensive especially for scenarios with large-scale scenes and limited data storage. In view of this, we explore the task of incremental learning for NIRs in this work. We design a student-teacher framework to mitigate the catastrophic forgetting problem. As a result, the student network is able to learn new incoming data while preserving old knowledge simultaneously, aided by a random inquirer and an uncertainty-based filter for effective knowledge merging. Our proposed method is general and thus can be adapted to different implicit representations such as neural radiance field (NeRF) and neural SDF. Extensive experimental results for both 3D reconstruction and novel view synthesis demonstrate the effectiveness of our approach compared to different baselines.

# 282

Strong Double Blind

Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction

Zijie Jiang · Tianhan Xu · Hiroharu Kato

Multi-view 3D surface reconstruction using neural implicit representations has made notable progress by modeling the geometry and view-dependent radiance fields within a unified framework. However, their effectiveness in reconstructing objects with specular or complex surfaces is typically biased by the directional parameterization used in their view-dependent radiance network. Viewing direction and reflection direction are the two most commonly used directional parameterizations but have their own limitations. Typically, utilizing the viewing direction usually struggles to correctly decouple the geometry and appearance of objects with highly specular surfaces, while using the reflection direction tends to yield overly smooth reconstructions for concave or complex structures. In this paper, we analyze their failed cases in detail and propose a novel hybrid directional parameterization to address their limitations in a unified form. Extensive experiments demonstrate the proposed hybrid directional parameterization consistently delivered satisfactory results in reconstructing objects with a wide variety of materials, geometry and appearance, whereas using other directional parameterizations faces challenges in reconstructing certain objects. Moreover, the proposed hybrid directional parameterization is nearly parameter-free and can be effortlessly applied in any existing neural surface reconstruction method.

# 280

Sur^2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Zhangjin Huang · Zhihao Liang · Kui Jia

Multi-view surface reconstruction is an ill-posed, inverse problem in 3D vision research. It involves modeling the geometry and appearance with appropriate surface representations. Most of the existing methods rely either on explicit meshes, using surface rendering of meshes for reconstruction, or on implicit field functions, using volume rendering of the fields for reconstruction. The two types of representations in fact have their respective merits. In this work, we propose a new hybrid representation, termed Sur$^2$f, aiming to better benefit from both representations in a complementary manner. Technically, we learn two parallel streams of an implicit signed distance field and an explicit surrogate surface (Sur$^2$f) mesh, and unify volume rendering of the implicit signed distance function (SDF) and surface rendering of the surrogate mesh with a shared, neural shader; the unified shading promotes their convergence to the same, underlying surface. We synchronize learning of the surrogate mesh by driving its deformation with functions induced from the implicit SDF. In addition, the synchronized surrogate mesh enables surface-guided volume sampling, which greatly improves the sampling efficiency per ray in volume rendering. We conduct thorough experiments showing that Sur$^2$f outperforms existing reconstruction methods and surface representations, including hybrid ones, in terms of both recovery quality and recovery efficiency.

# 274

Strong Double Blind

Differentiable Convex Polyhedra Optimization from Multi-view Images

Daxuan Ren · Haiyi Mei · Hezi Shi · Jianmin Zheng · Jianfei Cai · Lei Yang

This paper presents a novel approach for the differentiable rendering of convex polyhedra, addressing the limitations of recent methods that rely on implicit field supervision. Our technique introduces a strategy that combines non-differentiable computation of hyperplane intersection through duality transform with differentiable optimization for vertex positioning with three-plane intersection, enabling gradient-based optimization without the need for 3D implicit fields. This allows for efficient shape representation across a range of applications, from shape parsing to compact mesh reconstruction. This work not only overcomes the challenges of previous approaches but also sets a new standard for representing shapes with convex polyhedra.

# 240

Strong Double Blind

Combining Generative and Geometry Priors for Wide-Angle Portrait Correction

Lan Yao · Chaofeng Chen · Xiaoming Li · Zifei Yan · Wangmeng Zuo

Wide-angle lens distortion in portrait photography presents a significant challenge for capturing photo-realistic and aesthetically pleasing images. Such distortions are especially noticeable in face regions. To rectify facial distortions for a more natural appearance, we propose encapsulating the generative face prior derived from pre-trained StyleGAN. This prior is then leveraged to facilitate the correction of facial regions. For the non-face background, a notable central symmetry relationship exists in the wide-angle imaging process, yet it has not been explored in the correction process. This geometry prior motivates us to introduce a novel constraint with the explicit aim of preserving symmetry throughout the correction process, thereby contributing to a more visually appealing and natural correction in the non-face region. Experiments demonstrate that our approach outperforms previous methods by a large margin, excelling not only in quantitative measures such as line straightness and shape consistency metrics but also in terms of perceptual visual quality. Our source code and model will be made publicly available.

# 290

Strong Double Blind

I2-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM

Gwangtak Bae · Changwoon Choi · Hyeongjun Heo · Sang Min Kim · Young Min Kim

We present an inverse image-formation module that can enhance the robustness of existing visual SLAM pipelines for in-the-wild scenarios. Casual video captures often suffer from motion blur and varying appearances, which degrades the final quality of coherent 3D visual representation. We propose integrating the physical imaging into the SLAM system, which employs linear HDR radiance maps to collect measurements. Specifically, individual frames aggregate images of multiple poses along the camera trajectory to explain prevalent motion blur in hand-held videos. Additionally, we accommodate per-frame appearance variation by dedicating explicit variables for image formation steps, namely white balance, exposure time, and camera response function. Through joint optimization of additional variables, the SLAM pipeline produces high-quality images with more accurate trajectories. Extensive experiments demonstrate that our approach can be incorporated into recent visual SLAM pipelines using various scene representations, such as neural radiance fields or Gaussian splatting.

# 294

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

Aditya Prakash · Arjun Gupta · Saurabh Gupta

Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.

# 278

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

Tianqi Liu · Guangcong Wang · Shoukang Hu · Liao Shen · Xinyi Ye · Yuhang Zang · Zhiguo Cao · Wei Li · Ziwei Liu

We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware Gaussian representations and decode them into Gaussian parameters. 2) To further enhance performance, we propose a hybrid Gaussian rendering that integrates an efficient volume rendering design for novel view synthesis. 3) To support fast fine-tuning for specific scenes, we introduce a multi-view geometric consistent aggregation strategy to effectively aggregate the point clouds generated by the generalizable model, serving as the initialization for per-scene optimization. Compared with previous generalizable NeRF-based methods, which typically require minutes of fine-tuning and seconds of rendering per image, MVSGaussian achieves real-time rendering (300+ FPS) with better synthesis quality for each scene on a single RTX 3090 GPU. Compared with the vanilla 3D-GS, MVSGaussian achieves better novel view synthesis with 13.3× less training computational cost (45s). Extensive experiments on DTU, Real Forward-facing, NeRF Synthetic, and Tanks and Temples datasets validate that MVSGaussian achieves state-of-the-art performance with convincing generalizability, real-time rendering speed, and fast per-scene optimization.

# 268

CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians

Yang Liu · Chuanchen Luo · Lue Fan · Naiyan Wang · Junran Peng · Zhaoxiang Zhang

The advancement of real-time 3D scene reconstruction and novel view synthesis has been significantly propelled by 3D Gaussian Splatting (3DGS). However, effectively training large-scale 3DGS and rendering it in real-time across various scales remains challenging. This paper introduces CityGaussian (CityGS), which employs a novel divide-and-conquer training approach and Level-of-Detail (LoD) strategy for efficient large-scale 3DGS training and rendering. Specifically, the global scene prior and adaptive training data selection enables efficient training and seamless fusion. Based on fused Gaussian primitives, we generate different detail levels through compression, and realize fast rendering across various scales through the proposed block-wise detail levels selection and aggregation strategy. Extensive experimental results on large-scale scenes demonstrate that our approach attains state-of-the-art rendering quality, enabling consistent real-time rendering of large-scale scenes across vastly different scales.

# 264

GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting

XINJIE ZHANG · Xingtong Ge · Tongda Xu · Dailan He · Yan Wang · Hongwei Qin · Guo Lu · Jing Geng · Jun Zhang

Implicit Neural Representations (INRs) have proven effective in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3x lower GPU memory usage and 5x faster fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 1000 FPS. Additionally, initial proof of concept indicates that our codec surpasses COIN and COIN++ in performance when utilizing partial bits-back coding.

# 250

Strong Double Blind

FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally

Qiuhong Shen · Xingyi Yang · Xinchao Wang

This study addresses the challenge of accurately segmenting 3D Gaussian Splatting~(3D-GS) from 2D masks. Conventional methods often rely on iterative gradient descent to assign each Gaussian a unique label, leading to lengthy optimization and sub-optimal solutions. Instead, we propose a straightforward yet globally optimal solver for 3D-GS segmentation. The core insight of our method is that, with a reconstructed 3D-GS scene , the rendering of the 2D masks is essentially a linear function with respect to the labels of each Gaussian. As such, the optimal label assignment can be solved via linear programming in closed form. This solution capitalizes on the alpha blending characteristic of the splatting process for single step optimization. By incorporating the softening term in our objective function, our method shows superior robustness in 3D segmentation against noises. Remarkably, our optimization completes within 30 seconds, about 50$\times$ faster than the best existing methods. Extensive experiments demonstrate our method’s efficiency and robustness in segmenting various scenes, and its superior performance in downstream tasks such as object removal and inpainting. We will make all code and results publicly available.

# 235

PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis

Jason Yu · Tristan Aumentado-Armstrong · Fereshteh Forghani · Konstantinos Derpanis · Marcus A Brubaker

This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of known views. Our approach is not limited to generating a single image at a time and can condition on zero, one, or more views. As a result, when generating a large number of views, our method is not restricted to a low-order autoregressive generation approach and is better able to maintain generated image quality over large sets of images. We evaluate the proposed model on standard NVS datasets and show that it outperforms the state-of-the-art image-based GNVS baselines. Further, we show that the model is capable of generating sets of camera views that have no natural sequential ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.

# 244

MegaScenes: Scene-Level View Synthesis at Scale

Joseph Tung · Gene Chou · Ruojin Cai · Guandao Yang · Kai Zhang · Gordon Wetzstein · Bharath Hariharan · Noah Snavely

Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K SfM reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments we validate the effectiveness of both our dataset and method on generating in-the-wild scenes.

# 239

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Wangbo Yu · Li Yuan · Yanpei Cao · Xiangjun Gao · Xiaoyu Li · WENBO HU · Long Quan · Ying Shan · Yonghong Tian

Recent advances in diffusion models have enabled 3D generation from a single image. However, current methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods. Second, capitalizing on the RGNV, we present a novel Reference-Guided State Distillation (RGSD) loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods, both qualitatively and quantitatively. Video comparisons are available on the supplementary project page. We will release our code to the public.

# 233

View-Consistent 3D Editing with Gaussian Splatting

Yuxuan Wang · Xuanyu Yi · Zike Wu · Na Zhao · Long Chen · Hanwang Zhang

The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations. Currently, diffusion-based 2D editing models are harnessed to modify multi-view rendered images, which then guide the editing of 3DGS models. However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS. To this end, we introduce View-consistent Editing (VcEdit), a novel framework that seamlessly incorporates 3DGS into image editing processes, ensuring multi-view consistency in edited guidance images and effectively mitigating mode collapse issues. VcEdit employs two innovative consistency modules: the Cross-attention Consistency Module and the Editing Consistency Module, both designed to reduce inconsistencies in edited images. By incorporating these consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency, facilitating high-quality 3DGS editing across a diverse range of scenes.

# 260

Compress3D: a Compressed Latent Space for 3D Generation from a Single Image

Bowen Zhang · Tianyu Yang · Yu Li · Lei Zhang · Xi Zhao

3D generation has witnessed significant advancements, yet efficiently producing high-quality 3D assets from a single image remains challenging. In this paper, we present a triplane autoencoder, which encodes 3D models into a compact triplane latent space to effectively compress both the 3D geometry and texture information. Within the autoencoder framework, we introduce a 3D-aware cross-attention mechanism, which utilizes low-resolution latent representations to query features from a high-resolution 3D feature volume, thereby enhancing the representation capacity of the latent space. Subsequently, we train a diffusion model on this refined latent space. In contrast to solely relying on image embedding for 3D generation, our proposed method advocates for the simultaneous utilization of both image embedding and shape embedding as conditions. Specifically, the shape embedding is estimated via a diffusion prior model conditioned on the image embedding. Through comprehensive experiments, we demonstrate that our method outperforms state-of-the-art algorithms, achieving superior performance while requiring less training data and time. Our approach enables the generation of high-quality 3D assets in merely 7 seconds on a single A100 GPU. More results and visualization can be found on our project page: https://compress3d.github.io/

# 286

Strong Double Blind

Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

Qi Sun · Hang Zhou · Wengang Zhou · Li Li · Houqiang Li

Synthesizing realistic 3D indoor scenes is a challenging task that traditionally relies on manual arrangement and annotation by expert designers. Recent advances in autoregressive models have automated this process, but they often lack semantic understanding of the relationships and hierarchies present in real-world scenes, yielding limited performance. In this paper, we propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem. Forest2Seq organizes the inherently unordered collection of scene objects into structured, ordered hierarchical scene trees and forests. By employing a clustering-based algorithm and a breadth-first traversal, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively. Experimental results on standard benchmarks demonstrate Forest2Seq's superiority in synthesizing more realistic scenes compared to top-performing baselines, with significant improvements in FID and KL scores. Our additional experiments for downstream tasks and ablation studies also confirm the importance of incorporating order as a prior in 3D scene generation.

# 265

Strong Double Blind

3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views

Kennard Yanting Chan · Fayao Liu · Guosheng Lin · Chuan Sheng Foo · Weisi Lin

Pixel-aligned implicit models, such as Multi-view PIFu, DeepMultiCap, DoubleField, and SeSDF, are well-established methods for reconstructing a clothed human from sparse views. However, given V images, these models would only combine features from these images in a point-wise and localized manner. In other words, the V images are processed individually and are only combined in a very narrow fashion at the end of the pipeline. To a large extent, this defeats the purpose of having multi-view information since the multi-view task in question is predominantly treated as a single-view task. To resolve this, we introduce 3DFG-PIFu, a pixel-aligned implicit model that exploits multi-view information right from the start and all the way to the end of the pipeline. Our 3DFG-PIFu makes use of 3D Feature Grids to combine features from V images in a global manner (rather than point-wise or localized) and throughout the pipeline. Other than the 3D Feature Grids, 3DFG-PIFu also proposes an iterative mechanism that refines and updates an existing output human mesh using the different views. Moreover, 3DFG-PIFu introduces SDF-based SMPL-X features, which is a new method of incorporating a SMPL-X mesh into a pixel-aligned implicit model. Our experiments show that 3DFG-PIFu significantly outperforms SOTA models both qualitatively and quantitatively. Our code will be published.

# 225

Nuvo: Neural UV Mapping for Unruly 3D Representations

Pratul Srinivasan · Stephan J Garbin · Dor Verbin · Jon Barron · Ben Mildenhall

Existing UV mapping algorithms are designed to operate on well-behaved meshes, instead of the geometry representations produced by state-of-the-art 3D reconstruction and generation techniques. As such, applying these methods to the volume densities recovered by neural radiance fields and related techniques (or meshes triangulated from such fields) results in texture atlases that are too fragmented to be useful for tasks such as view synthesis or appearance editing. We present a UV mapping method designed to operate on geometry produced by 3D reconstruction and generation techniques. Instead of computing a mapping defined on a mesh's vertices, our method Nuvo uses a neural field to represent a continuous UV mapping, and optimizes it to be a valid and well-behaved mapping for just the set of visible points, i.e. only points that affect the scene's appearance. We show that our model is robust to the challenges posed by ill-behaved geometry, and that it produces editable UV mappings that can represent detailed appearance.

# 222

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Ruicheng Wang · Jianfeng Xiang · Jiaolong Yang · Xin Tong

We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation. Existing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.

# 218

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

Ian Huang · Guandao Yang · Leonidas Guibas

Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user's intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with ``imagined'' reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes.

# 228

Strong Double Blind

A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control

Karim Kadry · Shreya Gupta · Jonas Sogbadji · Michiel Schaap · Kersten Petersen · Takuya Mizukami · Carlos Collet · Farhad R. Nezami · Elazer R Edelman

Virtual interventions enable the physics-based simulation of device deployment within patient-specific coronary artery anatomy. This framework enables the exploration of alternate scenarios by deploying counterfactual device designs within the same anatomy, revealing critical design factors for patient outcomes. In contrast, our ability to simulate alternate scenarios with anatomic counterfactuals is highly limited. In this study, we investigate how Latent Diffusion Models (LDMs) can custom synthesize coronary anatomy for virtual intervention studies. We introduce several adaptations to enforce anatomic constraints regarding topological validity, local morphological shape, and global skeletal structure. Specifically, we regularize the LDM latent space to reduce topological defects and introduce a conditioning framework based on clinically interpretable and editable coronary morpho-skeletons. We lastly extend diffusion model guidance strategies to the context of morpho-skeletal conditioning and propose a novel guidance method that adaptively updates the guiding condition throughout sampling. Our framework enables the generation and editing of coronary anatomy in a controllable manner, allowing device designers to better explore the relationship

# 221

Strong Double Blind

DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Haibo Yang · Yang Chen · Yingwei Pan · Ting Yao · Zhineng Chen · Zuxuan Wu · Yu-Gang Jiang · Tao Mei

Learning radiance fields (NeRF) with powerful 2D diffusion models has garnered popularity for text-to-3D generation. Nevertheless, the implicit 3D representations of NeRF lack explicit modeling of meshes and textures over surfaces, and such surface-undefined way may suffer from the issues, e.g., noisy surfaces with ambiguous texture details or cross-view inconsistency. To alleviate this, we present DreamMesh, a novel text-to-3D architecture that pivots on well-defined surfaces (triangle meshes) to generate high-fidelity explicit 3D model. Technically, DreamMesh capitalizes on a distinctive coarse-to-fine scheme. In the coarse stage, the mesh is first deformed by text-guided Jacobians and then DreamMesh textures the mesh with an interlaced use of 2D diffusion models in a tuning free manner from multiple viewpoints. In the fine stage, DreamMesh jointly manipulates the mesh and refines the texture map, leading to high-quality triangle meshes with high-fidelity textured materials. Extensive experiments demonstrate that DreamMesh significantly outperforms state-of-the-art text-to-3D methods in faithfully generating 3D content with richer textual details and enhanced geometry.

# 208

TPA3D: Triplane Attention for Fast Text-to-3D Generation

Bin-Shih Wu · HONG-EN CHEN · Sheng-Yu Huang · Frank Wang

Due to the lack of large-scale text-3D correspondence data, recent text-to-3D generation works mainly rely on utilizing 2D diffusion models for synthesizing 3D data. Since diffusion-based methods typically require significant optimization time for both training and inference, the use of GAN-based models would still be desirable for fast 3D generation. In this work, we propose Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation. With only 3D shape data and their rendered 2D images observed during training, our TPA3D is designed to retrieve detailed visual descriptions for synthesizing the corresponding 3D mesh data. This is achieved by the proposed attention mechanisms on the extracted sentence and word-level text features. In our experiments, we show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions, while impressive computation efficiency can be observed.

# 224

Strong Double Blind

DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement

Qimin Chen · Zhiqin Chen · Vladimir Kim · Noam Aigerman · Hao Richard Zhang · Siddhartha Chaudhuri

We present a 3D modeling method which enables end-users to refine or detailize 3D shapes using machine learning, expanding the capabilities of AI-assisted 3D content creation. Given a coarse voxel shape (e.g., one produced with a simple box extrusion tool or via generative modeling), a user can directly “paint” desired target styles representing compelling geometric details, from input exemplar shapes, over different regions of the coarse shape. These regions are then up-sampled into high-resolution geometries which adhere with the painted styles. To achieve such controllable and localized 3D detailization, we build on top of a Pyramid GAN by making it masking-aware. We devise novel structural losses and priors to ensure that our method preserves both desired coarse structures and fine-grained features even if the painted styles are borrowed from diverse sources, e.g., different semantic parts and even different shape categories. Through extensive experiments, we show that our ability to localize details enables novel interactive creative workflows and applications. Our experiments further demonstrate that in comparison to prior techniques built on global detailization, our method generates structure-preserving, high-resolution stylized geometries with more coherent shape details and style transitions.

# 212

WordRobe: Text-Guided Generation of Textured 3D Garments

Astitva Srivastava · Pranav Manu · Amit Raj · Varun Jampani · Avinash Sharma

In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose WordRobe, a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.

# 205

Strong Double Blind

AnyHome: Open-Vocabulary Large-Scale Indoor Scene Generation with First-Person View Exploration

Rao Fu · Zehao Wen · Zichen Liu · Srinath Sridhar

Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures.

# 214

Strong Double Blind

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Guian Fang · Wenbiao Yan · Yuanfan Guo · Jianhua Han · Zutao Jiang · Hang Xu · Shengcai Liao · Xiaodan Liang

Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. The dataset and code will be released.

# 211

Strong Double Blind

SENC: Handling Self-collision in Neural Cloth Simulation

Zhouyingcheng Liao · Sinan Wang · Taku Komura

We present SENC, a novel self-supervised neural cloth simulator that addresses the challenge of cloth self-collision. This problem has remained unresolved due to the gap in simulation setup between recent collision detection and response approaches and self-supervised neural simulators. The former requires collision-free initial setups, while the latter necessitates random cloth instantiation during training. To tackle this issue, we propose a novel loss based on Global Intersection Analysis (GIA). This loss extracts the volume surrounded by the cloth region that forms the penetration. By constructing an energy based on this volume, our self-supervised neural simulator can effectively address cloth self-collisions. Moreover, we develop a self-collision-aware graph neural network capable of learning to handle self-collisions, even for parts that are topologically distant from one another. Additionally, we introduce an effective external force scheme that enables the simulation to learn the cloth's behavior in response to random external forces. We validate the efficacy of SENC through extensive quantitative and qualitative experiments, demonstrating that it effectively reduces cloth self-collision while maintaining high-quality animation results.

# 204

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

Xinzhou Wang · Yikai Wang · junliang ye · Fuchun Sun · Zhengyi Wang · Ling Wang · Pengkun Liu · Kai Sun · Xintong Wang · Xie wende · Fangfu Liu · Bin He

Advances in 3D generation have facilitated sequential 3D model generation (a.k.a 4D generation), yet its application for animatable objects with large motion remains scarce. Our work proposes {AnimatableDreamer}, a text-to-4D generation framework capable of generating diverse categories of non-rigid objects on skeletons extracted from a monocular video. At its core, {AnimatableDreamer} is equipped with our novel optimization design dubbed Canonical Score Distillation (CSD), which lifts 2D diffusion for temporal consistent 4D generation. CSD, designed from a score gradient perspective, generates a canonical model with warp-robustness across different articulations. Notably, it also enhances the authenticity of bones and skinning by integrating inductive priors from a diffusion model. Furthermore, with multi-view distillation, CSD infers invisible regions, thereby improving the fidelity of monocular non-rigid reconstruction. Extensive experiments demonstrate the capability of our method in generating high-flexibility text-guided 3D models from the monocular video, while also showing improved reconstruction performance over existing non-rigid reconstruction methods. Project page \url{https://AnimatableDreamer.github.io/}.

# 202

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Armen Avetisyan · Christopher Xie · Henry Howard-Jenkins · Tsun-Yi Yang · Samir Aroudj · Suvam Patra · Fuyang Zhang · Luke Holland · Duncan Frost · Campbell Orme · Jakob Engel · Edward Miller · Richard Newcombe · Vasileios Balntas

We introduce SceneScript,a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Anonymized Dataset consisting of 100k high-quality indoor scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.

# 107

Strong Double Blind

Diffusion Models as Data Mining Tools

Ioannis Siglidis · Aleksander Holynski · Alexei Efros · Mathieu Aubry · Shiry Ginosar

This paper demonstrates how to use generative models as data mining tools. Our insight is that generative methods learn an accurate model of their training data that we can analyze to summarize and understand that data. This analysis-by-synthesis approach to data mining has two key advantages. First, it scales much better than traditional correspondence-based approaches since it does not require explicitly comparing all pairs of visual elements. Second, generative models can disentangle factors of variation within the training data (such as appearance vs. geographic location), which are nearly always entangled within the data itself. In this work, we train a location-conditioned diffusion model on worldwide streetview data to mine for geographical visual patterns. We synthesize a dataset of parallel images depicting the same scene layouts across different locations using this model. We define typicality measures, assessing how characteristic visual elements are for geographic location, either country-specific or in terms of cross-country variability.

# 262

ReMatching: Low-Resolution Representations for Scalable Shape Correspondence

Filippo Maggioli · Daniele Baieri · Emanuele Rodola · Simone Melzi

We introduce \emph{ReMatching}, a novel shape correspondence solution based on the functional maps framework. Our method, by exploiting a new and appropriate \emph{re}-meshing paradigm, can target shape-\emph{matching} tasks even on meshes counting millions of vertices, where the original functional maps does not apply or requires a massive computational cost. The core of our procedure is a time-efficient remeshing algorithm which constructs a low-resolution geometry while acting conservatively on the original topology and metric. These properties allow translating the functional maps optimization problem on the resulting low-resolution representation, thus enabling efficient computation of correspondences with functional map approaches. Finally, we propose an efficient technique for extending the estimated correspondence to the original meshes. Through quantitative and qualitative comparisons, we show that our method is more efficient and effective, outperforming state-of-the-art pipelines in quality and computational cost.

# 310

Strong Double Blind

PolyRoom: Room-aware Transformer for Floorplan Reconstruction

Yuzhou Liu · Lingjie Zhu · Xiaodong Ma · Hanqiao Ye · Xiang Gao · Xianwei Zheng · Shuhan Shen

Reconstructing the geometry and topology structure from raw unstructured data has always been an important research topic in indoor mapping. In this paper, we aim to reconstruct the floorplan with a vectorized representation from point clouds. Although considerable advancements have been achieved in this field over recent years, current methods still face several challenges, including missing corners or edges, inaccurate corner positions or angles, self-intersecting or overlapping polygons and even implausible topology. To address these challenges, we present PolyRoom, a room-aware Transformer for floorplan reconstruction by introducing uniform sampling representation, room query initialization and hierarchical self-attention. Specifically, we first project the 3D point clouds as a 2D density map and initialize room queries through instance segmentation and uniform sampling. Then, in the Transformer, room queries first interact with each other with hierarchical self-attention. After that, with deformable cross-attention, vertex coordinates are refined layer by layer under dense supervision. Finally, the compact and structured floorplan is extracted through simple corner vertex selection operations. Experimental results on two widely used datasets demonstrate that PolyRoom outperforms current state-of-the-art methods both quantitatively and qualitatively. We will publicly share all our code.

# 308

Strong Double Blind

WindPoly: Polygonal Mesh Reconstruction via Winding Numbers

Xin He · Chenlei Lyu · Pengdi Huang · Hui Huang

Polygonal mesh reconstruction of a raw point cloud is a valuable topic in the field of computer graphics and 3D vision. Especially to 3D architectural models, polygonal mesh provides concise expressions for fundamental geometric structures while effectively reducing data volume. However, there have some limitations of traditional reconstruction methods: normal vector dependency, noisy points and defective parts sensitivity, and internal geometric structure lost, which reduce the practicality in real scene. In this paper, we propose a robust and efficient polygonal mesh reconstruction method to address the issues in architectural point cloud reconstruction task. It is an iterative adaptation process to detect planar shapes from scattered points. The initial structural polygonal mesh can be established in the constructed convex polyhedral space without assistance of normal vectors. Then, we develop an efficient polygon-based winding number strategy to orient polygonal mesh with global consistency. The significant advantage of our method is to provide a structural reconstruction for architectural point clouds and avoid point-based normal vector analysis. It effectively improves the robustness to noisy points and defective parts. More geometric details can be preserved in the reconstructed polygonal mesh. Experimental results show that our method can reconstruct concise, oriented and faithfully polygonal mesh that are better than results of state-of-the-art methods.

# 304

Strong Double Blind

Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack

Mingyu Yang · Daizong Liu · Keke Tang · Pan Zhou · Lixing Chen · Junyang Chen

With the maturity of depth sensors, point clouds have received increasing attention in various 3D safety-critical applications, while deep point cloud learning models have been shown to be vulnerable to adversarial attacks. Most existing 3D attackers rely on implicit global distance losses to perturb whole points, failing to restrict the proper 3D geometry as point clouds are highly structured. To this end, in this paper, we propose a novel Wavelet Patches Attack (WPA), which leverages local spectral attributes to identify curvature-aware patches for hiding imperceptible perturbations aligned with their local geometric characteristics. Specifically, WPA first transforms the point cloud into the spectral domain using a wavelet operator, obtaining potential geometric structures in different local regions. Each wavelet corresponds to different curvature contexts of local points. Then, by decomposing the 3D object with different curvature-aware levels through the wavelet coefficients, we can perceive the local geometric characteristics and get various curvature-consistent patches. At last, based on the curvature variations of patches, WPA introduces two-type perturbations along the tangent plane and normal vector direction to hide imperceptible noise in slow- and fast-variation patches for preserving the geometric-sensitive local characteristics of smoothness and sharpness, respectively. Experiments demonstrate the superior imperceptibility of our attack method, achieving favorable results on existing 3D classification models while exhibiting robust resistance to various defense mechanisms.

# 307

Strong Double Blind

Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

Hang Xu · Chen Long · Wenxiao Zhang · Yuan Liu · Zhen Cao · Zhen Dong · Bisheng Yang

In this paper, we explore a novel framework, EGIInet (Explicitly Guided Information Interaction Network), a model for View-guided Point cloud Completion (ViPC) task, which aims to restore a complete point cloud from a partial one with a single view image. In comparison with previous methods that relied on the global semantics of input images, EGIInet efficiently combines the information from two modalities by leveraging the geometric nature of the completion task. Specifically, we propose an explicitly guided information interaction strategy supported by modal alignment for point cloud completion. First, in contrast to previous methods which simply use 2D and 3D backbones to encode features respectively, we unified the encoding process to promote modal alignment. Second, we propose a novel explicitly guided information interaction strategy that could help the network identify critical information within images, thus achieving better guidance for completion. Extensive experiments demonstrate the effectiveness of our framework, and we achieved a new state-of-the-art (+16% CD over XMFnet) in benchmark datasets despite using fewer parameters than the previous methods. The pre-trained model and code will be released upon acceptance.

# 309

Strong Double Blind

Diffusion Bridges for 3D Point Cloud Denoising

Mathias Vogel · Keisuke Tateno · Marc Pollefeys · Federico Tombari · Marie-Julie Rakotosaona · Francis Engelmann

In this work, we address the task of point cloud denoising using a novel framework adapting Diffusion Schrödinger bridges to unstructured data like point sets. Unlike previous works that predict point-wise displacements from point features or learned noise distributions, our method learns an optimal transport plan between paired point clouds. In experiments on object datasets such as the PU-Net dataset and real-world datasets like ScanNet++ and ARKitScenes, P2P-Bridge improves by a notable margin over existing methods. Although our method demonstrates promising results utilizing solely point coordinates, we demonstrate that incorporating additional features like RGB information and point-wise DINOV2 features further improves the results. Code will be made public upon acceptance.

# 311

Strong Double Blind

Towards a Density Preserving Objective Function for Learning on Point Sets

Haritha Jayasinghe · Ioannis Brilakis

Accurate measurement of the discrepancy between point sets is crucial for point cloud learning tasks. Chamfer distance (CD) is favoured over more effective loss metrics such as Earth Mover's Distance (EMD) for this purpose due to its computational efficiency. Previous investigations into loss function improvements almost exclusively focus on 3D losses as static metrics, and ignore their dynamic behaviour during training. We show that directly modifying the correspondence criteria can prevent clustering of points during training, and lead to more uniform point distributions. We propose UniformCD, a novel 3D distance metric that prioritises matching the relative local densities of point neighbourhoods when assigning correspondences. Experiments demonstrate that the proposed loss function improves performance on a variety of tasks such as cloud completion, parametric model optimisation, as well as downstream task performance when used for self-supervised learning, achieving SOTA EMD results among point set objective functions. We show that the proposed method exploits local density information to converge towards globally optimum density distributions, narrowing the disparity between CD and EMD.

# 312

Strong Double Blind

Syn-to-Real Domain Adaptation for Point Cloud Completion via Part-based Approach

Yunseo Yang · Jihun Kim · Kuk-Jin Yoon

Acquiring complete point clouds for real-world scenarios is labor-intensive, making it impractical for conventional learning-based approaches. Numerous methods have been proposed to overcome this limitation by leveraging synthetic complete point clouds. While access to complete point clouds offers a notable advantage, they often struggle to bridge domain gaps, leading to sub-optimal performance. As a remedy, we propose a novel part-based framework for synthetic-to-real domain adaptation in point cloud completion. Our approach starts on the observation that domain gaps inherent in part information are relatively small, as parts are shared properties across categories regardless of domains. To employ part-based approach to point cloud completion, we introduce Part-Based Decomposition (PBD) module to generate part input point clouds. Subsequently, we design a Part-Aware Completion (PAC) module, which operates in a part-wise manner to produce complete point clouds. Within PAC, we devise a novel part-aware transformer to learn relationships between parts and utilize this information to infer missing parts in incomplete point clouds. Extensive experiments demonstrate that our part-based framework significantly outperforms existing studies on real-world point cloud datasets.

# 319

T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning

Weijie Wei · Fatemeh Karimi Nejadasl · Theo Gevers · Martin R. Oswald

The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand on annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches.

# 209

Strong Double Blind

Text2LiDAR: Text-guided LiDAR Point Clouds Generation via Equirectangular Transformer

Yang Wu · Kaihua Zhang · Jianjun Qian · Jin Xie · Jian Yang

The complex traffic environment and various weather conditions make the collection of LiDAR data expensive and challenging. Achieving high-quality and controllable LiDAR data generation is urgently needed, controlling with text is a common practice, but there is little research in this field. To this end, we propose Text2LiDAR, the first efficient, diverse, and text-controllable LiDAR data generation model. Specifically, we design an equirectangular transformer architecture, utilizing the designed equirectangular attention to capture LiDAR features in a manner with data characteristics. Then, we design a control-signal embedding injector to efficiently integrate control signals through the global-to-focused attention mechanism. Additionally, we devise a frequency modulator to assist the model in recovering high-frequency details, ensuring the clarity of the generated point cloud. To foster development in the field and optimize text-controlled generation performance, we construct nuLiDARtext which offers diverse text descriptors for 34,149 LiDAR point clouds from 850 scenes. Experiments on uncontrolled and text-controlled generation in various forms on KITTI-360 and nuScenes datasets demonstrate the superiority of our approach.

# 305

DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields

Yu Chi · Fangneng Zhan · Sibo Wu · Christian Theobalt · Adam Kortylewski

Progress in 3D computer vision tasks demands a huge amount of data, yet annotating multi-view images with 3D-consistent annotations, or point clouds with part segmentation is both time-consuming and challenging. This paper introduces DatasetNeRF, a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations, while utilizing minimal 2D human-labeled annotations. Specifically, we leverage the strong semantic prior within a 3D generative model to train a semantic decoder, requiring only a handful of fine-grained labeled samples. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data. The generated data is applicable across various computer vision tasks, including video segmentation and 3D point cloud segmentation in both synthetic and real-world scenarios. Our approach not only surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision on individual images, but also demonstrates versatility by being applicable to both articulated and non-articulated generative models. Furthermore, we explore applications stemming from our approach, such as 3D-aware semantic editing and 3D inversion.

# 289

Strong Double Blind

Computing the Lipschitz constant needed for fast scene recovery from CASSI measurements

Niels Chr. Overgaard · Anders Holst

The linear inverse problem associated with the standard model for hyperspectral image recovery from CASSI measurements is considered. This is formulated as the minimization of an objective function which is the sum of a total variation regularizer and a least squares loss function. Standard first-order iterative minimization algorithms, such as ISTA, FISTA and TwIST, require as input the value of the Lipschitz constant for the gradient of the loss function, or at least a good upper bound on this value, in order to select appropriate step lengths. For the loss term considered here, this Lipschitz constant equals the square of the largest singular value of the measurement map. In applications, this singular value is usually computed directly as the largest eigenvalue of a huge square matrix. This computation can sometimes be a bottleneck in an otherwise optimized algorithm. In this paper we effectively eliminate this bottleneck for CASSI reconstructions by showing how the Lipschitz constant can be calculated from a square matrix whose size is easily three orders of magnitudes smaller than in the direct approach.

# 288

Strong Double Blind

Regularizing Dynamic Radiance Fields with Kinematic Fields

Woobin Im · Geonho Cha · Sebin Lee · Jumin Lee · Juhyeong Seon · Dongyoon Wee · Sungeui Yoon

This paper presents a novel approach for reconstructing dynamic radiance fields from monocular videos. We integrate kinematics with dynamic radiance fields, bridging the gap between the sparse nature of monocular videos and the real-world physics. Our method introduces the kinematic field, capturing motion through kinematic quantities: velocity, acceleration, and jerk. The kinematic field is jointly learned with the dynamic radiance field by minimizing the photometric loss without motion ground truth. We further augment our method with physics-driven regularizers grounded in kinematics. We propose physics-driven regularizers that ensure the physical validity of predicted kinematic quantities, including advective acceleration and jerk. Additionally, we control the motion trajectory based on rigidity equations formed with the predicted kinematic quantities. In experiments, our method outperforms the state-of-the-arts by capturing physical motion patterns within challenging real-world monocular videos. Code will be available on GitHub.

# 306

Strong Double Blind

GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation

Bangyan Liao · Zhenjun Zhao · Lu Chen · Haoang Li · Daniel Cremers · Peidong Liu

Plane adjustment (PA) is important for many 3D applications, which involves simultaneous pose estimation and plane recovery. While significant progress has been made recently, it is still a challenging problem in the realm of multi-view point cloud registration. The successful convergence of current state-of-the-art methods heavily depends on good initialization. Furthermore, the time complexity renders existing approaches impractical for large-scale problems. To address these challenges, we exploit a novel optimization strategy termed Bi-Convex Relaxation for large scale plane adjustment to improve its efficiency, convergence region and scalability. In particular, we decouple the original complex problem into two simpler sub-problems, which are then reformulated using convex relaxation. Subsequently, we can alternately solve these two sub-problems until convergence. On top of this novel optimization strategy, we propose two variants of the algorithm, namely GlobalPointer and GlobalPointer++, based on point-to-plane and plane-to-plane error, respectively. Extensive experiments on both synthetic and real datasets demonstrate that our method is able to perform large-scale plane adjustment with linear time complexity, larger convergence basin, and poor initialization, while achieving similar accuracy as prior methods. We will release our code to the public for further study.

# 299

iMatching: Imperative Correspondence Learning

Chen Wang · Dasong Gao · Yun-Jou Lin · Youjie Xia · Chen Wang

Learning feature correspondence is a foundational task in computer vision, holding immense importance for downstream applications such as visual odometry and 3D reconstruction. Despite recent progress in data-driven models, feature correspondence learning is still limited by the lack of accurate per-pixel correspondence labels. To overcome this difficulty, we introduce a new self-supervised scheme, imperative learning (IL), for training feature correspondence. It enables correspondence learning on arbitrary uninterrupted videos without any camera pose or depth labels, heralding a new era for self-supervised correspondence learning. Specifically, we formulated the problem of correspondence learning as a bilevel optimization, which takes the reprojection error from bundle adjustment as a supervisory signal for the model. To avoid large memory and computation overhead, we leverage the stationary point to effectively back-propagate the implicit gradients through bundle adjustment. Through extensive experiments, we demonstrate superior performance on tasks including feature matching and pose estimation, in which we obtained an average of 30% accuracy gain over the state-of-the-art matching models. The source code will be made public to benefit the community.

# 293

Strong Double Blind

Fundamental Matrix Estimation Using Relative Depths

Yaqing Ding · Václav Vávra · Snehal Bhayani · Qianliang Wu · Jian Yang · Zuzana Kukelova

We propose a novel approach to the problem of estimating the fundamental matrix from point correspondences and their relative depths. Relative depths can be approximated from the scales of local features, which are commonly available or can be obtained from non-metric monocular depth estimates provided by popular deep learning-based methods. This makes the considered problem very relevant. To derive efficient solutions, we explore new geometric constraints on the fundamental matrix with known relative depths and present new algebraic constraints between the fundamental matrix and the translation vector. Using point correspondences and their relative depths, we derive novel efficient minimal solvers for two fully uncalibrated cameras, two cameras with different unknown focal lengths, and two cameras with equal unknown focal lengths, respectively. We propose different variants of these solvers based on the source of the relative depth information. We present detailed analyses and comparisons with state-of-the-art solvers, including results with 86,306 image pairs from three large-scale datasets.

# 297

Track Everything Everywhere Fast and Robustly

Yunzhou Song · Jiahui Lei · Ziyun Wang · Lingjie Liu · Kostas Daniilidis

We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for practical applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than 10 times faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.

# 296

Strong Double Blind

Learning to Make Keypoints Sub-Pixel Accurate

Shinjeong Kim · Marc Pollefeys · Daniel Barath

This work addresses the challenge of sub-pixel accuracy in detecting 2D local features, a cornerstone problem in computer vision. Despite the advancements brought by neural network-based methods like SuperPoint and ALIKED, these modern approaches lag behind classical ones such as SIFT in keypoint localization accuracy due to their lack of subpixel precision. We propose a novel network that enhances any detector with sub-pixel precision by learning an offset vector for detected features, thereby eliminating the need for designing specialized sub-pixel accurate detectors. This optimization directly minimizes test-time evaluation metrics like relative pose error. Through extensive testing with both nearest neighbors matching and the recent LightGlue matcher across various real-world datasets, our method consistently outperforms existing methods in accuracy. Moreover, it adds only around 7 ms to the time of a particular detector. We will make the code and models public.

# 185

Strong Double Blind

Shape-guided Configuration-aware Learning for Endoscopic-image-based Pose Estimation of Flexible Robotic Instruments

YIYAO MA · Kai Chen · Hon-Sing Tong · Ruofeng Wei · Yui-Lun Ng · Ka-Wai Kwok · Qi Dou

Accurate estimation of both the external orientation and internal bending angle is crucial for understanding a soft robot state within its environment. However, existing sensor-based methods face limitations in cost, environmental constraints, and integration issues. Conventional image-based methods struggle with the shape complexity of soft robots. In this paper, we propose a novel shape-guided configuration-aware learning framework for image-based soft robot pose estimation. Inspired by the recent advances in 2D-3D joint representation learning, we leverage the 3D shape prior of the soft robot to enhance its image-based shape representation. Concretely, we first extract the part-level geometry representation of the 3D shape prior, then adapt this representation to the image by querying the image features corresponding to different robot parts. Furthermore, we present an effective mechanism to dynamically deform the shape prior. It aims to mitigate the shape difference between the adopted shape prior and the soft robot depicted in the image. This more expressive shape guidance further boosts the image-based robot representation and can be effectively used for soft robot pose refinement. Extensive experiments on surgical soft robots demonstrate the advantages of our method when compared with a series of keypoint-based, skeleton-based and direct regression-based methods.

# 184

FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models

Andrea Caraffa · Davide Boscaini · Amir Hamza · Fabio Poiesi

Estimating the 6D pose of objects unseen during training is highly desirable yet challenging. Zero-shot object 6D pose estimation methods address this challenge by leveraging additional task-specific supervision provided by large-scale, photo-realistic synthetic datasets. However, their performance heavily depends on the quality and diversity of rendered data and they require extensive training. In this work, we show how to tackle the same task but without training on specific data. We propose FreeZe, a novel solution that harnesses the capabilities of pre-trained geometric and vision foundation models. FreeZe leverages 3D geometric descriptors learned from unrelated 3D point clouds and 2D visual features learned from web-scale 2D images to generate discriminative 3D point-level descriptors. We then estimate the 6D pose of unseen objects by 3D registration based on RANSAC. We also introduce a novel algorithm to solve ambiguous cases due to geometrically symmetric objects that is based on visual features. We comprehensively evaluate FreeZe across the seven core datasets of the BOP Benchmark, which include over a hundred 3D objects and 20,000 images captured in various scenarios. FreeZe consistently outperforms all state-of-the-art approaches, including competitors extensively trained on synthetic 6D pose estimation data. We will release the source code publicly.

# 181

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Jiyao Zhang · Weiyao Huang · Bo Peng · Mingdong Wu · Fei Hu · Zijian Chen · Bo Zhao · Hao Dong

6D Object Pose Estimation is a critical yet challenging task in the field of computer vision, distinguished from more traditional 2D tasks by its lack of large-scale datasets. This scarcity hampers comprehensive evaluation of model performance and consequently, limits research development while also restricting the applicability of research across diverse domains due to the limited number of instances or categories available. To address these issues and facilitate progress in this field, this paper introduces Omni6DPose, a substantial dataset characterized by its diversity in object categories, large scale, and variety in object materials. Omni6DPose is divided into three main components: ROPE (Real 6D Object Pose Estimation Dataset), which includes 270,000 images annotated with over one million annotations across 600 instances in 140 categories; SOPE(Simulated 6D Object Pose Estimation Dataset), consisting of 350,000 images created in a mixed reality setting with depth simulation, annotated with over one million annotations across 4,000 instances in the same 140 categories; and the manually aligned real scanned objects used in both ROPE and SOPE. Omni6DPose is inherently challenging due to the substantial variations and ambiguities. To address this challenge, we propose GenPose++, a novel framework incorporating a pre-trained 2D foundational model to enhance generalization capabilities and employing a diffusion-based generative approach to adeptly manage ambiguity issues. And, this paper provides a comprehensive benchmarking analysis to evaluate the performance of previous methods on this large-scale dataset in the realms of 6D object pose estimation and pose tracking.

# 182

Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation

Yangzheng Wu · Michael Alan Greenspan

We address the simulation-to-real domain gap in six degree- of-freedom pose estimation (6DoF PE), and propose a novel self-supervised keypoint voting-based 6DoF PE framework, effectively narrowing this gap using a learnable kernel in RKHS. We formulate this domain gap as a distance in high-dimensional feature space, distinct from previ- ous iterative matching methods. We propose an adapter network, that evolves the network parameters from the source domain, which has been trained on synthetic data with synthetic poses, to the target domain, which is trained on real data. Importantly, the real data training only uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real groundtruth data annotations. Our proposed method is called RKHSPose, and achieves state-of-the-art performance among self-supervised methods on three commonly used 6DoF PE datasets including LINEMOD (+4.2%), Occlusion LINEMOD (+2%), and YCB-Video (+3%). It also compares favorably to fully supervised methods on all six applicable BOP core datasets, achieving within −11.3% to +0.2% of the top fully supervised results.

# 196

Strong Double Blind

Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images

Tianyu Luan · Zhongpai Gao · Luyuan Xie · Abhishek Sharma · Hao Ding · Benjamin Planche · Meng Zheng · Ange Lou · Terrence Chen · Junsong Yuan · Ziyan Wu

We introduce a novel bottom-up approach for human body mesh reconstruction, specifically designed to address the challenges posed by partial visibility and occlusion in input images. Traditional top-down methods, relying on whole-body parametric models like SMPL, falter when only a small part of the human is visible, as they require visibility of most of the human body for accurate mesh reconstruction. To overcome this limitation, our method employs a "Divide and Fuse (D&F)" strategy, reconstructing human body parts independently before fusing them, thereby ensuring robustness against occlusions. We design Human Part Parametric Models (HPPM) that independently reconstruct the mesh from a few shape and global location parameters, without inter-part dependency. A specially designed fusion module then seamlessly integrates the reconstructed parts, even when only a few parts are visible. We harness a large volume of ground truth SMPL data to train our parametric mesh models. To facilitate the training and evaluation of our method, we have established benchmark datasets featuring images of partially visible humans with HPPM annotations. Our experiments, conducted on our benchmark datasets, demonstrate the effectiveness of our D&F method, particularly in scenarios with substantial invisibility, where traditional approaches struggle to maintain reconstruction quality.

# 177

Strong Double Blind

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

Haonan Wang · Jie Liu · Jie Tang · Gangshan Wu · Bo Xu · Yanbing Chou · Yong Wang

In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints.

# 301

D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction

Bowen Fu · Gu Wang · Chenyangguang Zhang · Yan Di · Ziqin Huang · Zhiying Leng · Fabian Manhardt · Xiangyang Ji · Federico Tombari

Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm, enhancing the stability of diffusion and reverse processes and the precision of feature projection. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding, enhancing the reconstruction performance of the hand-occluded region of the object. Experiments on the synthetic ObMan dataset and three real-world datasets HO3D, MOW and DexYCB demonstrate that our approach can surpass all other state-of-the-art methods. Codes will be released.

# 169

Strong Double Blind

Event-based Head Pose Estimation: Benchmark and Method

jiahui yuan · Hebei Li · Yansong Peng · Jin Wang · Yuheng Jiang · Yueyi Zhang · Xiaoyan Sun

Head pose estimation (HPE) is crucial for various applications, including human-computer interaction, augmented reality, and driver monitoring. However, traditional RGB-based methods struggle in challenging conditions like sudden movement and extreme lighting. Event cameras, as a neuromorphic sensor, have the advantages of high temporal resolution and high dynamic range, offering a promising solution for HPE. However, the lack of paired event and head pose data hinders the full potential of event-based HPE. To address this, we introduce two large-scale, diverse event-based head pose datasets encompassing 282 sequences across different resolutions and scenarios. Furthermore, we propose the event-based HPE network, featuring two novel modules: the Event Spatial-Temporal Fusion (ESTF) module and the Event Motion Perceptual Attention (EMPA) module. The ESTF module effectively combines spatial and temporal information from the event streams, while the EMPA module captures crucial motion details across the scene using a large receptive field. We also propose a unified loss function to optimize the network using both angle and rotation matrix information. Extensive experiments demonstrate the superior performance of our network on both datasets, showcasing its effectiveness in handling HPE across various challenging scenarios. The datasets and code are available at https://github.com/Jiahui-Yuan-1/EVHPE

# 194

Parameterized Quasi-Physical Simulators for Dexterous Manipulations Transfer

Xueyi Liu · Kangbo Lyu · jieqiong zhang · Tao Du · Li Yi

We explore the dexterous manipulation transfer problem by designing simulators. The task wishes to transfer human manipulations to dexterous robot hand simulations and is inherently difficult due to its intricate, highly-constrained, and discontinuous dynamics and the need to control a dexterous hand with a DoF to accurately replicate human manipulations. Previous approaches that optimize in high-fidelity black-box simulators or a modified one with relaxed constraints only demonstrate limited capabilities or are restricted by insufficient simulation fidelity. We introduce parameterized quasi-physical simulators and a physics curriculum to overcome these limitations. The key ideas are 1) balancing between fidelity and optimizability of the simulation via a curriculum of parameterized simulators, and 2) solving the problem in each of the simulators from the curriculum, with properties ranging from high task optimizability to high fidelity. We successfully enable a dexterous hand to track complex and diverse manipulations in high-fidelity simulated environments, boosting the success rate by 11\%+ from the best-performed baseline. We include \href{https://quasi-physical-sims.github.io/quasi-physical-sims-for-dex-manip/}{a website} to introduce the work.

# 285

Strong Double Blind

RAW-Adapter: Adapting Pretrained Visual Model to Camera RAW Images

Ziteng Cui · Tatsuya Harada

sRGB images are now the predominant choice for pre-training visual models in computer vision research, owing to their ease of acquisition and efficient storage. Meanwhile, the advantage of RAW images lies in their rich physical information under variable real-world lighting conditions. For computer vision tasks directly based on camera RAW data, most existing studies adopt methods of integrating image signal processor (ISP) with backend networks, yet often overlook the interaction capabilities between the ISP stages and subsequent networks. Drawing inspiration from ongoing adapter research in NLP and CV areas, we introduce RAW-Adapter, a novel approach aimed at adapting sRGB pre-trained models to camera RAW data. RAW-Adapter comprises input-level adapters that employ learnable ISP stages to adjust RAW inputs, as well as model-level adapters to build connections between ISP stages and subsequent high-level networks. Additionally, RAW-Adapter is a general framework that could be used in various computer vision frameworks. Abundant experiments under different lighting conditions have shown our algorithm's state-of-the-art (SOTA) performance, demonstrating its effectiveness and efficiency across a range of real-world and synthetic datasets. We would release our source code.

# 303

Strong Double Blind

Easing 3D Pattern Reasoning with Side-view Features for Semantic Scene Completion

Linxi Huan · Mingyue Dong · Linwei Yue · Shuhan Shen · Xianwei Zheng

This paper proposes a side-view context inpainting strategy (SidePaint) to ease the reasoning of unknown 3D patterns for semantic scene completion. Based on the observation that the learning burden on pattern completion increases with spatial complexity and feature sparsity, the SidePaint strategy is designed to decompose the complex 3D pattern learning into easier 2D context inpainting with dense feature volumes. Specifically, our approach densely lifts image features into 3D volume space with distance-aware projection, and reasons missing patterns in 2D side-view feature maps sliced from feature volumes. With the learning burden relieved by decreasing pattern complexity in 2D space, our SidePaint strategy enables more effective semantic completion than directly learning 3D patterns. Extensive experiments demonstrate the effectiveness of our SidePaint strategy on several challenging semantic scene completion benchmarks.`

# 292

Strong Double Blind

Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions

Fabio Tosi · Pierluigi Zama Ramirez · Matteo Poggi

We present a novel approach designed to address the complexities posed by challenging, out-of-distribution data in the single-image depth estimation task, including adverse weather conditions and non-Lambertian objects. Starting with images that facilitate depth prediction due to the absence of unfavorable factors, we systematically generate new, user-defined scenes with a comprehensive set of challenges and associated depth information. This is achieved by leveraging cutting-edge conditioned diffusion models, known for their ability to synthesize high-quality image content from textual prompts while preserving the coherence of the 3D structure between generated and source imagery. Subsequent fine-tuning of any monocular depth network, either supervised or self-supervised, is carried out through a self-distillation protocol that takes into account images generated using our strategy and its own depth predictions on simple, unchallenging scenes. Experimental results on benchmarks tailored for our purposes demonstrate the effectiveness and versatility of our proposal\footnote, showing its distinctive ability to simultaneously address adverse weather settings and non-Lambertian objects, and to deliver competitive results with respect to specialized state-of-the-art solutions designed exclusively for each individual challenge.

# 316

Strong Double Blind

GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

Aurélien Cecille · Stefan Duffner · Franck DAVOINE · Thibault Neveu · Rémi Agier

Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.

# 315

Strong Double Blind

Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Shengjie Zhu · Girish Chandar Ganesan · Abhinav Kumar · Xiaoming Liu

3D sensing is a fundamental task for Autonomous Vehicles. Its deployment often relies on aligned RGB cameras and LiDAR. Despite meticulous synchronization and calibration, systematic misalignment persists in the LiDAR projected depthmap. This is due to the physical baseline distance between the two sensors. The artifact is often reflected as background LiDAR incorrectly overlayed onto the foreground, such as cars and pedestrian. The KITTI dataset uses stereo cameras as a heuristic solution. However most AV datasets, including nuScenes, Waymo, and DDAD, lack stereo images, making the KITTI solution inapplicable. This work proposes a parameter-free analytical solution to remove the projective artifacts. We construct a binocular vision system between a hypothesized virtual LiDAR camera and the RGB camera. We then remove the projective artifacts by determining the epipolar occlusion with the proposed analytical solution. We show unanimous improvement in the State-of-The-Art (SoTA) monocular depth estimators and 3D object detectors with the artifacts-free depthmaps. Our code and the processed depthmaps of major AV datasets will be publicly available.

# 328

Strong Double Blind

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

ye junyan · Zhutao Lv · Li Weijia · Jinhua Yu · Haote Yang · Huaping Zhong · Conghui He

Cross-view geolocalization identifies the geographic location of street view images by matching them with a georeferenced satellite database. Significant challenges arise due to the drastic appearance and geometry differences between views. In this paper, we propose a new approach for cross-view image geo-localization, i.e., the Panorama-BEV Co-Retrieval Network. Specifically, by utilizing the ground plane assumption and geometric relations, we convert street view panorama images into the BEV view, reducing the gap between street panoramas and satellite imagery. In the existing retrieval of street view panorama images and satellite images, we introduce BEV and satellite image retrieval branches for collaborative retrieval. By retaining the original street view retrieval branch, we overcome the limited perception range issue of BEV representation. Our network enables comprehensive perception of both the global layout and local details around the street view capture locations. Additionally, we introduce CVGlobal, a global cross-view dataset that is closer to real-world scenarios. This dataset adopts a more realistic setup, with street view directions not aligned with satellite images. CVGlobal also includes cross-regional, cross-temporal, and street view to map retrieval tests, enabling a comprehensive evaluation of algorithm performance. Our method excels in multiple tests on common cross-view datasets such as CVUSA, CVACT, VIGOR, and our newly introduced CVGlobal, surpassing the current state-of-the-art approaches. The code and datasets can be found at \url{https://github.com/yejy53/EP-BEV}.

# 302

Strong Double Blind

CountFormer: Multi-View Crowd Counting Transformer

Hong Mo · Xiong Zhang · Jianchao Tan · Cheng Yang · Qiong Gu · Bo Hang · Wenqi Ren

Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability. In this work, we propose a concise 3D MVC framework called \textbf{CountFormer} to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences. Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. The multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

# 86

Strong Double Blind

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

Yi Zhang · Wang Zeng · Sheng Jin · Chen Qian · Ping Luo · Wentao Liu

Recent years have witnessed increasing research attention towards pedestrian detection by taking the advantages of different sensor modalities (\eg RGB, IR, Depth, LiDAR and Event). However, designing a unified generalist model that can effectively process diverse sensor modalities remains a challenge. This paper introduces MMPedestron, a novel generalist model for multimodal perception. Unlike previous specialist models that only process one or a pair of specific modality inputs, MMPedestron is able to process multiple modal inputs and their dynamic combinations. The proposed approach comprises a unified encoder for modal representation and fusion and a general head for pedestrian detection. We introduce two extra learnable tokens, \ie MAA and MAF, for adaptive multi-modal feature fusion. In addition, we construct the MMPD dataset, the first large-scale benchmark for multi-modal pedestrian detection. This benchmark incorporates existing public datasets and a newly collected dataset called EventPed, covering a wide range of sensor modalities including RGB, IR, Depth, LiDAR, and Event data. With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks, surpassing leading models tailored for specific sensor modality. For example, it achieves 71.1 AP on COCO-Persons and 72.6 AP on LLVIP. Notably, our model achieves comparable performance to the InternImage-H model on CrowdHuman with $30\times$ smaller parameters. Codes and data are available at \url{https://github.com/BubblyYi/MMPedestron}.

# 317

Strong Double Blind

MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation

Xiaoshuai Hao · Ruikai Li · Hui Zhang · Rong Yin · Dingzhe Li · Sangil Jung · Seung-In Park · ByungIn Yoo · Haimei Zhao · Jing Zhang

Online high-definition (HD) map construction is an important and challenging task in autonomous driving. Recently, there has been a growing interest in cost-effective multi-view camera-based methods without relying on other sensors like LiDAR. However, these methods suffer from a lack of explicit depth information, necessitating the use of large models to achieve satisfactory performance. To address this, we employ the Knowledge Distillation (KD) idea for efficient HD map construction for the first time and introduce a novel KD-based approach called MapDistill to transfer knowledge from a high-performance Camera-LiDAR fusion model to a lightweight camera-only model. Specifically, we adopt the teacher-student architecture, i.e., a Camera-LiDAR fusion model as the teacher and a lightweight camera model as the student, and devise a dual BEV transform module to facilitate cross-modal knowledge distillation while maintaining cost-effective camera-only deployment. Additionally, we present a comprehensive distillation scheme encompassing cross-modal relation distillation, dual-level feature distillation, and map head distillation. This approach alleviates knowledge transfer challenges between modalities, enabling the student model to learn improved feature representations for HD map construction. Experimental results on the challenging nuScenes dataset demonstrate the effectiveness of MapDistill, surpassing existing competitors by over 7.7 mAP or 4.5x speedup. The source code will be released.

# 321

Strong Double Blind

4D Contrastive Superflows are Dense 3D Representation Learners

Xiang Xu · Lingdong Kong · Hui Shuai · Wenwei Zhang · Liang Pan · Kai Chen · Ziwei Liu · Qingshan Liu

In the realm of autonomous driving, accurate 3D perception is the foundation. However, developing such models relies on extensive human annotations -- a process that is both costly and labor-intensive. To address this challenge from a data representation learning perspective, we introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing spatiotemporal pretraining objectives. SuperFlow stands out by integrating two key designs: 1) a dense-to-sparse consistency regularization, which promotes insensitivity to point cloud density variations during feature learning, and 2) a flow-based contrastive learning module, carefully crafted to extract meaningful temporal cues from readily available sensor calibrations. To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances the alignment of the knowledge distilled from camera views. Extensive comparative and ablation studies across 11 heterogeneous LiDAR datasets validate our effectiveness and superiority. Additionally, we observe several interesting emerging properties by scaling up the 2D and 3D backbones during pretraining, shedding light on the future research of 3D foundation models for LiDAR-based perception. Our code is attached to the supplementary material and will be made publicly available.

# 318

Strong Double Blind

TCC-Det: Temporarily consistent cues for weakly-supervised 3D detection

Jan Skvrna · Lukas Neumann

Accurate object detection in LiDAR point clouds is a key prerequisite of robust and safe autonomous driving and robotics applications. Training the 3D object detectors currently involves the need to manually annotate vasts amounts of training data, which is very time-consuming and costly. As a result, the amount of annotated training data readily available is limited, and moreover these annotated datasets likely do not contain edge-case or otherwise rare instances, simply because the probability of them occurring in such a small dataset is low. In this paper, we propose a method to train 3D object detector without any need for manual annotations, by exploiting existing off-the-shelf vision components and by using the consistency of the world around us. The method can therefore be used to train a 3D detector by only collecting sensor recordings in the real world, which is extremely cheap and allows training using orders of magnitude more data than traditional fully-supervised methods. The method is evaluated on the both KITTI validation and test datasets, where it outperforms all previous weakly-supervised methods and where it narrows the gap when compared to methods using human 3D labels.

# 327

Strong Double Blind

CARB-Net: Camera-Assisted Radar-Based Network for Vulnerable Road User Detection

Wei-Yu Lee · Martin Dimitrievski · David Van Hamme · Jan Aelterman · Ljubomir Jovanov · Wilfried Philips

Ensuring a reliable perception of vulnerable road users is crucial for safe Autonomous Driving. Radar stands out as an appealing sensor choice due to its resilience in adverse weather, cost-effectiveness, depth sensing capabilities, and established role in adaptive cruise control. Nevertheless, radar's limited angular resolution poses challenges in object recognition, especially in distinguishing targets in close proximity. To tackle this limitation, we present the Camera-Assisted Radar-Based Network (CARB-Net), a novel and efficient framework that merges the angular accuracy of a camera with the robustness and depth sensing capabilities of radar. We integrate camera detection information through a ground plane feed-forward array, entangling it with the early stages of a radar-based detection network. Furthermore, we introduce a unique context learning approach to ensure graceful degradation in situations of poor radar Doppler information or unfavorable camera viewing conditions. Experimental validations on two datasets, along with benchmark comparisons, showcase CARB-Net's superiority, boasting up to a 12% improvement in mAP performance. A series of ablation studies further emphasize the efficacy of the CARB-Net architecture.

# 324

Strong Double Blind

SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving

Qingwen Zhang · Yi Yang · Peizheng Li · Olov Andersson · Patric Jensfelt

Scene flow estimation predicts the 3D motion at each point in successive LiDAR scans. This detailed, point-level, information can help autonomous vehicles to accurately predict and understand dynamic changes in their surroundings. Current state-of-the-art methods require annotated data to train scene flow networks and the expense of labeling inherently limits their scalability. Self-supervised approaches can overcome the above limitations, yet face two principal challenges that hinder optimal performance: point distribution imbalance and disregard for object-level motion constraints. In this paper, we propose SeFlow, a self-supervised method that integrates efficient dynamic classification into a learning-based scene flow pipeline. We demonstrate that classifying static and dynamic points helps design targeted objective functions for different motion patterns. We also emphasize the importance of internal cluster consistency and correct object point association to refine the scene flow estimation, in particular on object details. Our real-time capable method achieves state-of-the-art performance on the self-supervised scene flow task on Argoverse 2 and Waymo datasets. The code is open-sourced at AnonymousforReview along with trained model weights.

# 332

Strong Double Blind

RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception

Shen Jianbing · Chunliang Li · Wencheng Han · Junbo Yin · Sanyuan Zhao

In the domain of autonomous driving, the concurrent processing of multiple 3D perception tasks within the same spatiotemporal scene poses a significant challenge, in particular due to the computational inefficiencies and feature competition between tasks when using traditional multi-task learning approaches. This paper addresses these issues by proposing a novel unified representation, RepVF, which harmonizes the representation of various perception tasks such as 3D object detection and 3D lane detection within a single framework. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model that significantly reduces computational redundancy and feature competition. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks by utilizing a hierarchical structure of queries that implicitly model the relationships both between and within tasks. This approach eliminates the need for task-specific heads and parameters, fundamentally reducing the conflicts inherent in traditional multi-task learning paradigms. We validate our approach by combining labels from the OpenLane dataset with the Waymo Open dataset. Our work presents a significant advancement in the efficiency and effectiveness of multi-task perception in autonomous driving, offering a new perspective on handling multiple 3D perception tasks synchronously and in parallel.

# 326

Strong Double Blind

TrafficNight : An Aerial Multimodal Benchmark For Nighttime Vehicle Surveillance

Guoxing Zhang · Yiming Liu · xiaoyu yang · Chao Huang · HUANG Hailong

In autonomous simulation and surveillance, realistic scenarios are crucial for advancing object detection algorithms. Existing aerial datasets suffer from sample class imbalance, especially in larger vehicles like trucks, and unrealistic lighting conditions. This hampers progress in driving behavior analysis and imitation. To address these limitations, we introduce a novel multimodal vehicle surveillance dataset, integrating aerial thermal infrared and sRGB imagery. It contributes: (1) A novel thermal infrared vehicle detection benchmark, ensuring robust object detection in nighttime lighting conditions. (2) Thermal infrared surveillance videos paired with corresponding HD-MAPs for improved multi-vehicle tracking. (3) Specialized annotations for semi-trailers, precisely documenting their movement trajectories and physical coordinates.TrafficNight significantly advances understanding of larger vehicles in traffic dynamics, serving as a benchmark for enhancing Autopilot systems and traffic surveillance in challenging environments.

# 314

RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes

Thang-Anh-Quan Nguyen · Luis G Roldao Jimenez · Nathan Piasco · Moussab Bennehar · Dzmitry Tsishkou

The task of separating dynamic objects from static environments using NeRFs has been widely studied in recent years. However, capturing large-scale scenes still poses a challenge due to their complex geometric structures and unconstrained dynamics. Without the help of 3D motion cues, previous methods often require simplified setups with slow camera motion and only a few/single dynamic actors, leading to suboptimal solutions in most urban setups. To overcome such limitations, we present RoDUS, a pipeline for decomposing static and dynamic elements in urban scenes, with thoughtfully separated NeRF models for moving and non-moving components. Our approach utilizes a robust kernel-based initialization coupled with 4D semantic information to selectively guide the learning process. This strategy enables accurate capturing of the dynamics in the scene, resulting in reduced artifacts caused by NeRF on background reconstruction, all by using self-supervision. Notably, experimental evaluations on KITTI-360 and Pandaset datasets demonstrate the effectiveness of our method in decomposing challenging urban scenes into precise static and dynamic components.

# 313

Strong Double Blind

Monocular Occupancy Prediction for Scalable Indoor Scenes

Hongxiao Yu · Yuqi Wang · Yuntao Chen · Zhaoxiang Zhang

Camera-based 3D occupancy prediction has recently garnered increasing attention in outdoor driving scenes. However, research in indoor scenes remains relatively unexplored. The core differences in indoor scenes lie in the complexity of scene scale and the variance in object size. In this paper, we propose a novel method, named ISO, for predicting indoor scene occupancy using monocular images. ISO harnesses the advantages of a pretrained depth model to achieve accurate depth predictions. Subsequently, it employs a Dual Feature Line of Sight Projection(D-FLoSP) module to facilitate the learning of 3D voxel features. Additionally, we introduce a large-scale occupancy benchmark for indoor scenes, titled Occ-ScanNet. With a dataset size 40 times larger than the NYUv2 dataset, it facilitates future scalable research in indoor scene analysis. Experimental results on both NYUv2 and Occ-ScanNet demonstrate that our method achieves state-of-the-art performance. The dataset and code will be made publicly available.

# 322

Strong Double Blind

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

Benjin Zhu · zhe wang · Hongsheng LI

Existing benchmarks for 3D semantic occupancy prediction in autonomous driving are limited by low resolution (up to [512×512×40] with 0.2m voxel size) and inaccurate annotations, hindering the unification of 3D scene understanding through the occupancy representation. Moreover, previous methods can only generate occupancy predictions at 0.4m resolution or lower, requiring post-upsampling to reach their full resolution (0.2m). The root of these limitations lies in the sparsity, noise, and even errors present in the raw data. In this paper, we overcome these challenges by introducing nuCraft, a high-resolution and accurate semantic occupancy dataset derived from nuScenes. nuCraft offers an 8× increase in resolution ([1024 × 1024 × 80] with voxel size of 0.1m) and more precise semantic annotations compared to previous benchmarks. To address the high memory cost of high-resolution occupancy prediction, we propose VQ-Occ, a novel method that encodes occupancy data into a compact latent feature space using a VQ-VAE. This approach simplifies semantic occupancy prediction into feature simulation in the VQ latent space, making it easier and more memory-efficient. Our method enables direct generation of semantic occupancy fields at high resolution without post-upsampling, facilitating a more unified approach to 3D scene understanding. We validate the superior quality of nuCraft and the effectiveness of VQ-Occ through extensive experiments, demonstrating significant advancements over existing benchmarks and methods.

# 156

Strong Double Blind

Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

Sehwan Choi · Jun Won Choi · JUNGHO KIM · Hongjae Shin

Predicting vectorized high-definition (HD) map online is useful for autonomous driving, providing detailed geometric and semantic information on the surrounding road environment. In this paper, we introduce Mask2Map, a novel end-to-end online HD map construction method. Our approach identifies semantic components within a scene represented in the bird's eye view (BEV) domain and then generates a precise vectorized map topology based on this information. Mask2Map comprises two main components: an Instance-level Mask Prediction Network (IMPNet) and a Mask-Driven Map Prediction Network (MMPNet). IMPNet generates a mask-aware query capable of producing BEV segmentation masks, while MMPNet accurately constructs vectorized map components, leveraging the semantic geometric information provided by the mask-aware query. For enhancing HD map predictions, we design innovative modules for MMPNet based on outputs from IMPNet. We present a Positional Feature Generator that generates instance-level positional features by utilizing the comprehensive spatial context from semantic components of instance. We also propose a Geometric Feature Extractor which extracts point-level geometric features using sparse key points pooled from the segmentation masks. Furthermore, we present the denoising training strategy for inter-network consistency to boost the performance of map construction. Our evaluation conducted on nuScenes and Argoverse2 benchmarks demonstrates that our Mask2Map achieves a remarkable performance improvement over previous state-of-the-art methods by 10.1 mAP and 4.1 mAP. The code will be available soon.

# 325

CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

Jiezhi Yang · Khushi P Desai · Charles Packer · Harshil bhatia · Nicholas Rhinehart · Rowan McAllister · Joseph E Gonzalez

We propose CARFF, Conditional Auto-encoded Radiance Field for 3D Scene Forecasting, a method for predicting future 3D scenes given past observations. Our method maps 2D ego-centric images to a distribution over plausible 3D latent scene configurations and predicts the evolution of hypothesized scenes through time. Our latents condition a global Neural Radiance Field (NeRF) to represent a 3D scene model, enabling explainable predictions and straightforward downstream planning. This approach models the world as a POMDP and considers complex scenarios of uncertainty in environmental states and dynamics. Specifically, we employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations, and auto-regressively predict latent scene representations utilizing a mixture density network. We demonstrate the utility of our method in scenarios using the CARLA driving simulator, where CARFF enables efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving occlusions. Video and code are available at: www.carff.website.

# 331

Strong Double Blind

Neural Volumetric World Models for Autonomous Driving

Zanming Huang · Jimuyang Zhang · Eshed Ohn-Bar

Effectively navigating a dynamic 3D world requires a comprehensive understanding of 3D geometry and motion of surrounding objects and layouts. However, existing methods for perception and planning in autonomous driving primarily rely on a 2D spatial representation, based on a Bird's Eye Perspective of the scene, which are insufficient for modeling motion characteristics and decision-making in real-world 3D settings with occlusion, partial observability, subtle motions, and differing terrains. Motivated by this key insight, we present a novel framework for learning end-to-end autonomous driving based on volumetric representations. Our proposed neural volumetric world modeling approach, NeMo, can be trained in a self-supervised manner for image reconstruction and occupancy prediction tasks, benefiting scalable training and deployment paradigms, such as imitation learning. Specifically, we demonstrate how the higher-fidelity modeling of 3D volumetric representation benefits vision-based motion planning. We further propose a motion flow module to model complex dynamic scenes, enabling additional robust spatial-temporal consistency supervision. Moreover, a temporal attention module is introduced to effectively integrate predicted future volumetric features for the planning task. Our proposed sensorimotor agent achieves state-of-the-art motion planning performance in open-loop evaluation settings in nuScenes, outperforming prior baseline methods by over 18% in L_2 error.

# 340

Strong Double Blind

Progressive Pretext Task Learning for Human Trajectory Prediction

Xiaotong Lin · Tianming Liang · Jian-Huang Lai · Jian-Fang Hu

Human trajectory prediction is a practical task of predicting the future positions of pedestrians on the road, which typically covers all temporal ranges from short-term to long-term within a trajectory. However, existing works attempt to address the entire trajectory prediction with a singular, uniform training paradigm, neglecting the distinction between short-term and long-term dynamics in human trajectories. To overcome this limitation, we introduce a novel Progressive Pretext Task learning (PPT) framework, which progressively enhances the model's capacity of capturing short-term dynamics and long-term dependencies for the final entire trajectory prediction. Specifically, we elaborately design three stages of training tasks in the PPT framework. In the first stage, the model learns to comprehend the short-term dynamics through a stepwise next-position prediction task. In the second stage, the model is further enhanced to understand long-term dependencies through a destination prediction task. In the final stage, the model aims to address the entire future trajectory task by taking full advantage of the knowledge from previous stages. To alleviate the knowledge forgetting, we further apply a cross-task knowledge distillation. Additionally, we design a Transformer-based trajectory predictor, which is able to achieve high-efficiency reasoning by integrating a two-step reasoning strategy and a group of parallel position-specific prompt embeddings. We conduct extensive experiments on four popular benchmarks, and the results demonstrate that our approach is able to achieve state-of-the-art performance with high efficiency.

# 335

Strong Double Blind

Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving

Yixuan Fan · Ya-Li Li · Shengjin Wang

Planning for the ego vehicle is the ultimate goal of autonomous driving. Although deep learning-based methods have been widely applied to predict future trajectories of other agents in traffic scenes, directly using them to plan for the ego vehicle is often unsatisfactory. This is due to misaligned objectives during training and deployment: a planner that only aims to imitate human driver trajectories is insufficient to accomplish driving tasks well. We argue that existing training processes may not endow models with an understanding of how the physical world evolves. To address this gap, we propose RaSc, which stands for Risk-aware Self-consistent imitation learning. RaSc not only imitates driving trajectories, but also learns the motivations behind human driver behaviors (to be risk-aware) and the consequences of its own actions (by being self-consistent). These two properties stem from our novel prediction branch and training objectives regarding Time-To-Collision (TTC). Moreover, we enable the model to better mine hard samples during training by checking its self-consistency. Our experiments on the large-scale real-world nuPlan dataset demonstrate that RaSc outperforms previous state-of-the-art learning-based methods, in both open-loop and, more importantly, closed-loop settings.

# 337

Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries

WEI-JER Chang · Francesco Pittaluga · Masayoshi TOMIZUKA · Wei Zhan · Manmohan Chandraker

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail safety-critical traffic scenarios. However, traditional methods for generating such scenarios often fall short in terms of controllability and realism, and neglect the dynamics of agent interactions. To mitigate these limitations, we introduce Safe-Sim, a novel diffusion-based controllable closed-loop safety-critical simulation framework. Our approach yields two distinct advantages: 1) the generation of realistic long-tail safety-critical scenarios that closely emulate real-world conditions, and 2) enhanced controllability, enabling more comprehensive and interactive evaluations. We develop a novel approach to simulate safety-critical scenarios through an adversarial term in the denoising process, which allows an adversarial agent to challenge a planner with plausible maneuvers, while all agents in the scene exhibit reactive and realistic behaviors. Furthermore, we propose novel guidance objectives and a partial diffusion process that enables a user to control key aspects of the generated scenarios such as the collision type and aggressiveness of the adversarial driver while maintaining the realism of the behavior. We validate our framework empirically using the NuScenes dataset, demonstrating improvements in both realism and controllability. These findings affirm that diffusion models provide a robust and versatile foundation for safety-critical, interactive traffic simulation, extending their utility across the broader landscape of autonomous driving.

# 68

Strong Double Blind

Towards Dual Transparent Liquid Level Estimation in Biomedical Lab: Dataset, Methods and Practice

Xiayu Wang · Ke Ma · Ruiyun Zhong · Xinggang Wang · Yi Fang · Yang Xiao · Tian Xia

“Dual Transparent Liquid” refers to a liquid and its container, both being transparent. Accurately estimating the levels of such a liquid from arbitrary viewpoints is fundamental and crucial, especially in AI-guided autonomous biomedical laboratories for tasks like liquid dispensing, aspiration, and mixing. However, current methods for estimating liquid level focus on scenarios of a single instance captured from a fixed view. We propose a new paradigm for dual transparent liquid level estimation, including a dataset, methods, and practices. The dual transparent liquid dataset, named DTLD, comprises 27,458 images with four object instances captured from multiple views across three biomedical lab scenes. Based on DTLD, we propose an end-to-end learning method for detecting the liquid contact line, followed by an approach to estimate the liquid level. To enhance contact line detection, a color rectification module is proposed to stabilize the color distribution at the local region of the air-liquid interface. Additionally, our method surpasses the current best approach, reducing the mean absolute percentage error by a percentage of 43.4. The dataset and code will be available at https://github.com/dualtransparency/TCLD.

# 193

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Yufu Wang · Ziyun Wang · Lingjie Liu · Kostas Daniilidis

We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by 60% from prior work.

# 291

Strong Double Blind

Temporally Consistent Stereo Matching

Jiaxi Zeng · Chengtang Yao · Yuwei Wu · Yunde Jia

Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to exploit temporal information to improve the temporal consistency and accuracy of stereo matching. To this end, we build a temporally consistent stereo matching network, which includes two stages. In the first stage, we leverage temporal information to obtain a well-initialized disparity. In the second stage, we iteratively refine the disparity based on the temporal initialization. Specifically, we propose a temporal disparity completion module, which completes a semi-dense disparity map transformed from the previous moment. Then, we use a temporal state fusion module to fuse the state of the completion module and the hidden state of the refinement from the previous frame, providing a coherent state for further refinement. Based on this coherent state, we introduce a dual-space refinement module to iteratively refine the initialized result in both the disparity space and the disparity gradient space, improving the estimations in ill-posed regions. Extensive experiments demonstrate that our method effectively alleviates temporal inconsistency and enhances accuracy and efficiency. As of present, our method ranks second on the KITTI 2015 benchmark, while achieving superior efficiency compared to other state-of-the-art methods.

# 255

Retrieval Robust to Object Motion Blur

Rong Zou · Marc Pollefeys · Denys Rozumnyi

Moving objects are frequently seen in daily life and usually appear blurred in images due to their motion. While general object retrieval is a widely explored area in computer vision, it primarily focuses on sharp and static objects, and retrieval of motion-blurred objects in large image collections remains unexplored. We propose a method for object retrieval in images that are affected by motion blur. The proposed method learns a robust representation capable of matching blurred objects to their deblurred versions and vice versa. To evaluate our approach, we present the first large-scale datasets for blurred object retrieval, featuring images with objects exhibiting varying degrees of blur in various poses and scales. We conducted extensive experiments, showing that our method outperforms state-of-the-art retrieval methods on the new blur-retrieval datasets, which validates the effectiveness of the proposed approach.

# 248

Strong Double Blind

Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions

Weng Fei Low · Gim Hee Lee

The stark contrast in the design philosophy of an event camera makes it particularly ideal for operating under high-speed, high dynamic range and low-light conditions, where standard cameras underperform. Nonetheless, event cameras still suffer from some amount of motion blur, especially under these challenging conditions, in contrary to what most think. This is attributed to the limited bandwidth of the event sensor pixel, which is mostly proportional to the light intensity. Thus, to ensure that event cameras can truly excel in such conditions where it has an edge over standard cameras, it is crucial to account for event motion blur in downstream applications, especially reconstruction. However, none of the recent works on reconstructing Neural Radiance Fields (NeRFs) from events, nor event simulators, have considered the full effects of event motion blur. To this end, we propose, Deblur e-NeRF, a novel method to directly and effectively reconstruct blur-minimal NeRFs from motion-blurred events generated under high-speed motion or low-light conditions. The core component of this work is a physically-accurate pixel bandwidth model proposed to account for event motion blur under arbitrary speed and lighting conditions. We also introduce a novel threshold-normalized total variation loss to improve the regularization of large textureless patches. Experiments on real and novel realistically simulated sequences verify our effectiveness. Our code, event simulator and synthetic event dataset will be open-sourced.

# 242

Strong Double Blind

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Taewoo Kim · Hoonhee Cho · Kuk-Jin Yoon

Video deblurring aims to enhance the quality of restored results in motion-blurred videos by effectively gathering information from adjacent video frames to compensate for the insufficient data in a single blurred frame. However, when faced with consecutively severe motion blur situations, frame-based video deblurring methods often fail to find accurate temporal correspondence among neighboring video frames, leading to diminished performance. To address this limitation, we aim to solve the video deblurring task by leveraging an event camera with micro-second temporal resolution. To fully exploit the dense temporal resolution of the event camera, we propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, iteratively enhancing cross-modality features in a recurrent manner to better utilize the rich temporal information of events, 2) Inter-frame temporal feature alignment gather valuable long-range temporal information to target frames, aggregating sharp features leveraging the advantages of the events. In addition, we present a novel dataset composed of real-world blurred RGB videos, corresponding sharp videos, and event data. This dataset serves as a valuable resource for evaluating event-guided deblurring methods. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets. The code and dataset will be made publicly available upon acceptance.

# 276

Strong Double Blind

Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework

Shengqi Xu · Run Sun · Yi Chang · Shuning Cao · Xueyao Xiao · Luxin Yan

Long-range imaging inevitably suffers from atmospheric turbulence with severe geometric distortions due to random refraction of light. The further the distance, the more severe the disturbance. Despite existing research has achieved great progress in tackling short-range turbulence, there is less attention paid to long-range turbulence with significant distortions. To address this dilemma and advance the field, we construct a large-scale real long-range atmospheric turbulence dataset (RLR-AT), including 1500 turbulence sequences spanning distances from 1 Km to 13 Km. The advantages of RLR-AT compared to existing ones: turbulence with longer-distances and higher-diversity, scenes with greater-variety and larger-scale. Moreover, most existing work adopts either registration-based or decomposition-based methods to address distortions through one-step mitigation. However, they fail to effectively handle long-range turbulence due to its significant pixel displacements. In this work, we propose a coarse-to-fine framework to handle severe distortions, which cooperates dynamic turbulence and static background priors (CDSP). On the one hand, we discover the pixel motion statistical prior of turbulence, and propose a frequency-aware reference frame for better large-scale distortion registration, greatly reducing the burden of refinement. On the other hand, we take advantage of the static prior of background, and propose a subspace-based low-rank tensor refinement model to eliminate the misalignments inevitably left by registration while well preserving details. The dynamic and static priors complement to each other, facilitating us to progressively mitigate long-range turbulence with severe distortions. Extensive experiments demonstrate that the proposed method outperforms SOTA methods on different datasets.

# 178

Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Tao Huang · Guangqi Jiang · Yanjie Ze · Huazhe Xu

Learning rewards from expert videos offers an affordable and effective solution to specify the intended behaviors for reinforcement learning (RL) tasks. In this work, we propose Diffusion Reward, a novel framework that learns rewards from expert videos via conditional video diffusion models for solving complex visual RL problems. Our key insight is that lower generative diversity is exhibited when conditioning diffusion on expert trajectories. Diffusion Reward is accordingly formalized by the negative of conditional entropy that encourages productive exploration of expert behaviors. We show the efficacy of our method over 10 robotic manipulation tasks from MetaWorld and Adroit with visual input and sparse reward. Moreover, Diffusion Reward can even solve unseen tasks successfully and effectively, largely surpassing baseline methods.

# 188

Strong Double Blind

HUMOS: Human Motion Model Conditioned on Body Shape

Shashank Tripathi · Omid Taheri · Christoph Lassner · Michael J. Black · Daniel Holden · Carsten Stoll

Generating realistic human motion is an important task in many computer vision and graphics applications. The rich diversity of human body shapes and sizes significantly influences how people move. However, existing motion models typically ignore these differences and use a normalized, average body size. This leads to a homogenization of motion across human bodies that limits diversity and that may not align with their physical attributes. We propose a novel approach to learn a generative motion model conditioned on body shape. We demonstrate that it is possible to learn such a model from unpaired training data using cycle consistency and intuitive physics and stability constraints that model the correlation between identity and movement. The resulting model produces diverse, physically plausible, dynamically stable human motions that are quantitatively and qualitatively more realistic than the existing state of the art.

# 187

Strong Double Blind

PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture

Zhuojun Li · Chun Yu · Chen Liang · Yuanchun Shi

The data scarcity problem is a crucial factor that hampers the model performance of IMU-based human motion capture. However, effective data augmentation for IMU-based motion capture is challenging, since it has to capture the physical relations and constraints of human body, while maintaining the data distribution and quality. We propose PoseAugment, a novel pipeline incorporating VAE-based pose generation and physical optimization. Given a pose sequence, the VAE module generates infinite poses with both high fidelity and diversity, while keeping the data distribution. The physical module optimizes poses to satisfy physical constraints with minimal motion restrictions. High-quality IMU data are then synthesized from the augmented poses for training motion capture models. Experiments show that PoseAugment outperforms previous data augmentation and pose generation methods in terms of motion capture accuracy, revealing a strong potential of our method to alleviate the data collection burden for IMU-based motion capture and related tasks driven by human poses.

# 186

Large Motion Model for Unified Multi-Modal Motion Generation

Mingyuan Zhang · Daisheng Jin · Chenyang Gu · Fangzhou Hong · Zhongang Cai · Jingfang Huang · Chongzhi Zhang · Xinying Guo · Lei Yang · Ying He · Ziwei Liu

Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion model for future research.

# 192

Realistic Human Motion Generation with Cross-Diffusion Models

Zeping Ren · Shaoli Huang · Xiu Li

We introduce the Cross Human Motion Diffusion Model (CrossDiff), a novel approach for generating high-quality human motion based on textual descriptions. Our method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model, unifying motion noise into a single feature space. This enables cross-decoding of features into both 3D and 2D motion representations, regardless of their original dimension. The primary advantage of CrossDiff is its cross-diffusion mechanism, which allows the model to reverse either 2D or 3D noise into clean motion during training. This capability leverages the complementary information in both motion representations, capturing intricate human movement details often missed by models relying solely on 3D information. Consequently, CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences. In our experiments, our model demonstrates competitive state-of-the-art performance on text-to-motion benchmarks. Moreover, our method consistently provides enhanced motion generation quality, capturing complex full-body movement intricacies. Additionally, with a pretrained model, our approach accommodates using in the wild 2D motion data without 3D motion ground truth during training to generate 3D motion, highlighting its potential for broader applications and efficient use of available data resources.

# 190

Strong Double Blind

Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions

Yijun Qian · Jack Urbanek · Alexander Hauptmann · Jungdam Won

The field of 3D human motion generation from natural language descriptions, known as Text2Motion, has gained significant attention for its potential application in industries such as film, gaming, and AR/VR. To tackle a key challenge in Text2Motion, the deficiency of 3D human motions and their corresponding textual descriptions, we built a novel large-scale 3D human motion dataset, LaViMo, extracted from in-the-wild web videos and action recognition datasets. LaViMo is approximately 3.3 times larger and encompasses a much broader range of actions than the largest available 3D motion dataset. We then introduce a novel multi-task framework TMT (Text Motion Translator), aimed at generating faithful 3D human motions from natural language descriptions, especially focusing on complicated actions and those not existing in the training set. In contrast to prior works, TMT is uniquely regularized by multiple tasks, including Text2Motion, Motion2Text, Text2Text, and Motion2Motion. This multi-task regularization significantly bolsters the model's robustness and enhances its ability of motion modeling and semantic understanding. Additionally, we devised an augmentation method for the textual descriptions using Large Language Models. This augmentation significantly enhances the model's capability to interpret open-vocabulary descriptions while generating motions. The results demonstrate substantial improvements over existing state-of-the-art methods, particularly in handling diverse and novel motion descriptions, laying a strong foundation for future research in the field.

# 199

Generating Human Interaction Motions in Scenes with Text Control

Hongwei Yi · Justus Thies · Michael J. Black · Xue Bin Peng · Davis Rempe

We present TeSMo, a framework for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate this process, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interaction, as well as the realism and variety of the generated motions. Code will be released upon publication of this work.

# 167

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Bolin Lai · Fiona Ryan · Wenqi Jia · Miao Liu · James Rehg

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate the audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning.

# 220

Strong Double Blind

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Santiago Pascual · Chunghsin YEH · Ioannis Tsiamas · Joan Serrà

Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models.

# 217

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

Yong Zhong · Min Zhao · Zebin You · Xiaofeng Yu · Changwang Zhang · Chongxuan Li

In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video.

# 216

MoVideo: Motion-Aware Video Generation with Diffusion Models

Jingyun Liang · Yuchen Fan · Kai Zhang · Radu Timofte · Luc Van Gool · Rakesh Ranjan

While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.

# 226

FreeInit: Bridging Initialization Gap in Video Diffusion Models

Tianxing Wu · Chenyang Si · Yuming Jiang · Ziqi Huang · Ziwei Liu

Though video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the inference quality drop. Our key findings are: 1) the spatial-temporal frequency distribution of the initial latent's signal-to-noise ratio (SNR) at inference is intrinsically different from training, and 2) the denoising process is significantly influenced by the low-frequency component of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency component of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation results of various text-to-video generation models without additional training.

# 219

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

Hyeonho Jeong · Jinho Chang · GEON YEONG PARK · Jong Chul Ye

Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion. Codes and data will be released upon publication. (Project page: https://anony12anony34.github.io/ )

# 223

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Xiang Fan · Anand Bhattad · Ranjay Krishna

We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.

# 227

ReNoise: Real Image Inversion Through Iterative Noising

Daniel Garibi · Or Patashnik · Andrey Voynov · Hadar Averbuch-Elor · Danny Cohen-Or

Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities. However, applying these methods to real images necessitates the inversion of the images into the domain of the pretrained diffusion model. Achieving faithful inversion remains a challenge, particularly for more recent models trained to generate images with a small number of denoising steps. In this work, we introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations. Building on reversing the diffusion sampling process, our method employs an iterative renoising mechanism at each inversion sampling step. This mechanism refines the approximation of a predicted point along the forward diffusion trajectory, by iteratively applying the pretrained diffusion model, and averaging these predictions. We evaluate the performance of our ReNoise technique using various sampling algorithms and models, including recent accelerated diffusion models. Through comprehensive evaluations and comparisons, we show its effectiveness in terms of both accuracy and speed. Furthermore, we confirm that our method preserves editability by demonstrating text-driven image editing on real images.

# 117

Strong Double Blind

Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting

Yu Liu · Fatimah binti Khalid · Lei Wang · Youxi Zhang · Cunrui Wang

The electronic writing tools, while enhancing convenience, sacrifice the readability and efficiency of handwritten content. Balancing high efficiency with readable handwriting poses a challenging research task. In this paper, we propose a method sequence-based models to beautify user handwritten traces. Unlike most existing methods that treat Chinese handwriting as images and cannot reflect the human writing process, we capture individual writing characteristics from a small amount of user handwriting trajectories and beautify the user's traces by mimicking their writing style and process. We fully consider the style of radicals and components between the content and reference glyphs, assigning appropriate fine-grained styles to strokes in the content glyphs through a cross-attention mechanism module. Additionally, we find that many style features contribute minimally to the final stylized results. Therefore, we decompose the style features into the Cartesian product of single-dimensional variable sets, effectively removing redundant features with limited impact on the stylization effect while preserving key style information. Qualitative and quantitative experiments both demonstrate the superiority of our approach.

# 118

Strong Double Blind

One-Shot Diffusion Mimicker for Handwritten Text Generation

Gang Dai · Yifan Zhang · Quhui Ke · Qiangya Guo · Shuangping Huang

Existing handwritten text generation methods often require more than ten handwriting samples as style references. However, in practical applications, users tend to prefer a handwriting generation model that operates with just a single reference sample for its convenience and efficiency. This approach, known as ''one-shot generation'', significantly simplifies the process but poses a significant challenge due to the difficulty of accurately capturing a writer's style from a single sample, especially when extracting fine details from the characters' edges amidst sparse foreground and undesired background noise. To address this problem, we propose a One-shot Diffusion Mimicker (One-DM) to generate handwritten text that can mimic any calligraphic style with only one reference sample. Inspired by the fact that high-frequency information of the individual sample often contains distinct style patterns (e.g., character slant and letter joining), we develop a novel style-enhanced module to improve the style extraction by incorporating high-frequency components from a single sample. We then fuse the style features with the text content as a merged condition for guiding the diffusion model to produce high-quality handwritten text images. Extensive experiments demonstrate that our method can successfully generate handwriting scripts with just one sample reference in multiple languages, even outperforming previous methods using over ten samples. Our source code will be made publicly available.

# 111

Investigating Style Similarity in Diffusion Models

Gowthami Somepalli · Anubhav Anubhav · Kamal Gupta · Shramay Palta · Micah Goldblum · Jonas A. Geiping · Abhinav Shrivastava · Tom Goldstein

Generative models are now widely used by graphic designers and artists. Prior works have shown that these models tend to remember and often replicate content from the training data during generation. Hence as their proliferation increase, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus largely on retrieving images of similar semantic content. Meanwhile, many artists are concerned with the extent of style replication in text-to-image models. We present a framework to understand and extract style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image capturing complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We show promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model.

# 200

Strong Double Blind

DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation

Yi-Hao Peng · Faria Huq · Yue Jiang · Jason Wu · Xin Yue Li · Jeffrey Bigham · Amy Pavel

Enabling machines to understand structured visuals like slides and user interfaces is essential for making them accessible to people with disabilities. However, achieving such understanding computationally has required manual data collection and annotation, which is time-consuming and labor-intensive. To overcome this challenge, we present a method to generate synthetic, structured visuals with target labels using code generation. Our method allows people to create datasets with built-in labels and train models with a small number of human-annotated examples. We demonstrate performance improvements in three tasks for understanding slides and UIs: recognizing visual elements, describing visual content, and classifying visual content types.

# 128

Strong Double Blind

PartCraft: Crafting Creative Objects by Parts

Kam Woh Ng · Xiatian Zhu · Yi-Zhe Song · Tao Xiang

This paper propels creative control in generative visual AI by allowing users to ``select''. Departing from traditional text or sketch-based methods, we for the first time allow users to choose visual part concepts for their creative endeavors. The outcome is fine-grained generation that precisely captures selected visual part concepts, ensuring a holistically faithful and plausible result. To achieve this, we first parse objects into parts through unsupervised feature clustering. Then, we encode parts into text tokens and introduce an entropy-based normalized attention loss that operates on them. This loss design enables our model to learn generic prior topology knowledge about object's part composition, and further generalize to novel part compositions to ensure the generation looks holistically faithful. Lastly, we employ a bottleneck encoder to project the part tokens, this not only enhances fidelity but also accelerates learning, by leveraging shared knowledge and facilitating information exchange among instances. Visual results in the paper and supplementary material showcase the compelling power of PartCraft in crafting highly customized, innovative creations.

# 215

DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators

Hanyang Kong · Dongze Lian · Michael Bi Mi · Xinchao Wang

We introduce DreamDrone, a novel zero-shot and training-free pipeline for generating unbounded flythrough scenes from textual prompts. Different from other methods that focus on warping images frame by frame, we advocate explicitly warping the intermediate latent code of the pre-trained text-to-image diffusion model for high-quality image generation and generalization ability. To further enhance the fidelity of the generated images, we also propose a feature-correspondence-guidance diffusion process and a high-pass filtering strategy to promote geometric consistency and high-frequency detail consistency, respectively. Extensive experiments reveal that DreamDrone significantly surpasses existing methods, delivering highly authentic scene generation with exceptional visual quality, without training or fine-tuning on datasets or reconstructing 3D point clouds in advance.

# 116

Strong Double Blind

WAS: Dataset and Methods for Artistic Text Segmentation

Xudong Xie · Yuzhe Li · Yang Liu · Zhifei Zhang · Zhaowen Wang · Wei Xiong · Xiang Bai

Accurate text segmentation results are crucial for text-related generative tasks, such as text image generation, text editing, text removal, and text style transfer. Recently, some scene text segmentation methods have made significant progress in segmenting regular text. However, these methods perform poorly in scenarios containing artistic text. Therefore, this paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset. One challenge of the task is that the local stroke shapes of artistic text are changeable with diversity and complexity. We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes. Another challenge is the complexity of the global topological structure. We further design a skeleton-assisted head to guide the model to focus on the global structure. Additionally, to enhance the generalization performance of the text segmentation model, we propose a strategy for training data synthesis, based on the large multi-modal model and the diffusion model. Experimental results show that our proposed method and synthetic dataset can significantly enhance the performance of artistic text segmentation and achieve state-of-the-art results on other public datasets. We will release our datasets and source code.

# 114

Strong Double Blind

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

Shiyue Zhang · Zheng Chong · Xujie Zhang · Hanhui Li · Yuhao Cheng · yiqiang yan · Xiaodan Liang

General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.

# 106

PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Junsong Chen · Chongjian GE · Enze Xie · Yue Wu · Lewei Yao · Xiaozhe Ren · Zhongdao Wang · Ping Luo · Huchuan Lu · ZHENGUO LI

In this paper, we introduce PixArt-Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Sigma represents a significant advancement over its predecessor, PixArt-Alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-Alpha, it evolves from the weaker' baseline to astronger' model via incorporating higher quality data, a process we term ``weak-to-strong training''. The advancements in PixArt-Sigma are twofold: (1) High-Quality Training Data: PixArt-Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size~(0.6B parameters) than existing text-to-image diffusion models, such as SDXL~(2.6B parameters) and SD Cascade~(5.1B parameters). Moreover, PixArt-Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

# 105

Strong Double Blind

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Shanyan Guan · Yanhao Ge · Ying Tai · Jian Yang · Wei Li · Mingyu You

Subject-driven generation for text-to-image diffusion models aims to encode and inverse specific textual prompts in order to generate personalized images with particular content. Previous studies successfully achieved such a goal by using an optimization-based textual inversion or direct-regression-based concept encoding strategy. However, there are still challenges on how to realize fast and effective prompt inversion while guaranteeing the generalization of the original diffusion models. Motivated by the advantages of both optimization-based and direct-regression-based methods, in this study we proposed a novel hybrid prompt inversion framework called ~\name~ to achieve efficient subject-driven generation of text-to-image diffusion models. In detail, we address the limitations caused by the current optimization-based and direct-regression-based methods by designing a novel hybrid prompt inversion framework and combining it with a mask-guided multi-word text encoding module to enable a fast and robust prompt inversion, additionally, we import a hybrid textual feature fusion module to enhance the representation of the textual feature during learning. As a result, our framework manages to inverse arbitrary visual concepts to a pre-trained diffusion model in an effective and fast way even learning from a single image, and maintaining the general generation ability of the original model. The extensive experiments reveal the effectiveness of our method.

# 108

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

Reyhane Askari Hemmat · Melissa Hall · Alicia Yi Sun · Candace Ross · Michal Drozdzal · Adriana Romero-Soriano

With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference-time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a ``memory bank'' of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst-performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.

# 110

Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

Yi Wu · Ziqiang Li · Heliang Zheng · Chaoyue Wang · Bin Li

Drawing on recent advancements in diffusion models for text-to-image generation, identity-preserved personalization has made significant progress in accurately capturing specific identities with just a single reference image. However, existing methods primarily integrate reference images within the text embedding space, leading to a complex entanglement of image and text information, which poses challenges for preserving both identity fidelity and semantic consistency. To tackle this challenge, we propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. Specifically, we introduce identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information while deactivating the original text cross-attention module of the diffusion model. This ensures that the image stream faithfully represents the identity provided by the reference image while mitigating interference from textual input. Additionally, we introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams. This mechanism not only enhances the fidelity of identity and semantic consistency but also enables convenient control over the styles of the generated images. Extensive experimental results on both raw photo generation and style image generation demonstrate the superior performance of our proposed method.

# 112

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

Benjamin J Biggs · Arjun Seshadri · Yang Zou · Achin Jain · Aditya Golatkar · Yusheng Xie · Alessandro Achille · Ashwin Swaminathan · Stefano Soatto

We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by reaveraging. We show that a Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which reduces model memorization, offers copyright protection guarantees, and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30\% improvement in Image Reward (.34 $\to$ .44) on a domain sharded data, and a 59\% improvement in IR (.37 $\to$ .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 $\to$ 86.5 and 85.6 $\to$ 86.8). We demonstrate robust unlearning---removing any individual domain shard only lowers performance by 1\% in IR (.45 $\to$ .44)---and validate our theoretical insights on copyright protection on real data. Finally, we showcase Diffusion Soup's ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.

# 103

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Jie Ren · Yaxin Li · Shenglai Zeng · Han Xu · Lingjuan Lyu · Yue Xing · Jiliang Tang

Recent advancements in text-to-image diffusion models have demonstrated their remarkable capability to generate high-quality images from textual prompts. However, increasing research indicates that these models memorize and replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks. In our study, we provide a novel perspective to understand this memorization phenomenon by examining its relationship with cross-attention mechanisms. We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens. The diffusion model is overfitted to these token embeddings, memorizing corresponding training images. To elucidate this phenomenon, we further identify and discuss various intrinsic findings of cross-attention that contribute to memorization. Building on these insights, we introduce an innovative approach to detect and mitigate memorization in diffusion models. The advantage of our proposed method is that it will not compromise the speed of either the training or the inference processes in these models while preserving the quality of generated images.

# 104

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Chi-Pin Huang · Kai-Po Chang · Chung-Ting Tsai · Yung-Hsuan Lai · Fu-En Yang · Yu-Chiang Frank Wang

Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept. To perform reliable concept erasure, the properties of robustness and locality are desirable. The former refrains the model from producing images associated with the target concept for any paraphrased or learned prompts, while the latter preserves its ability in generating images with non-target concepts. In this paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler). It learns a lightweight Eraser to perform concept erasing while satisfying the above desirable properties by proposed concept-localized regularization and adversarial prompt learning schemes. Comprehensive experiments with various concepts verify the superiority of Receler over previous methods. Our code will be available upon acceptance.

# 213

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Yue Han · Junwei Zhu · Keke He · Xu Chen · Yanhao Ge · Wei Li · Xiangtai Li · Jiangning Zhang · Chengjie Wang · Yong Liu

Previous techniques for face reenactment and swapping predominantly rely on GAN frameworks. However, recent research has shifted its focus towards leveraging large diffusion models for these tasks, owing to their superior generation capabilities. Nonetheless, training these models incurs significant computational costs, and the results have not yet attained satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing by pretrained diffusion models, which contains: 1) Spatial Condition Generator provides precise landmarks and background; 2) Plug-and-play Identity Encoder transfers face embeddings to the text space by a transformer decoder. 3) Attribute Controller integrates spatial condition and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned models in face reenactment/swapping tasks. Additionally, Face-Adapter seamlessly integrates with popular pretrained diffusion models such as StableDiffusion. Full codes will be available.

# 99

Strong Double Blind

DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

Sarah Jabbour · Gregory Kondas · Ella Kazerooni · Michael Sjoding · David Fouhey · Jenna Wiens

We propose a permutation-based explanation method for image classifiers. Current image model explanations like activation maps are limited to instance-based explanations in the pixel space, making it difficult to understand global model behavior. Permutation based explanations for tabular data classifiers measure feature importance by comparing original model performance to model performance on data after permuting a feature. We propose an explanation method for image-based models that permutes interpretable concepts across dataset images. Given a dataset of images labeled with specific concepts like captions, we permute a concept across examples and then generate images via a text-conditioned diffusion model. Model importance is then given by the change in classifier performance relative to unpermuted data. When applied to a set of concepts, the method generates a ranking of concept importance. We show that this approach recovers underlying model feature importance on synthetic and real-world image classification tasks.

# 101

Do text-free diffusion models learn discriminative visual representations?

Soumik Mukhopadhyay · Matthew Gwilliam · Yosuke Yamaguchi · Vatsal Agarwal · Namitha Padmanabhan · Archana Swaminathan · Tianyi Zhou · Jun Ohya · Abhinav Shrivastava

Diffusion models have proven to be state-of-the-art methods for generative tasks. These models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. However, text-free diffusion models have typically not been explored for discriminative tasks. In this work, we take a pre-trained unconditional diffusion model and analyze its features post hoc. We find that the intermediate feature maps of the pre-trained U-Net are diverse and have hidden discriminative representation properties. To unleash the potential of these latent properties of diffusion models, we present novel aggregation schemes. Firstly, we propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of different diffusion U-Net blocks and noise steps. Next, we also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art representation learning methods for discriminative tasks -- image classification with full and semi-supervision, transfer for fine-grained classification, object detection, and semantic segmentation.

# 87

Strong Double Blind

DataDream: Few-shot Guided Dataset Generation

Jae Myung Kim · Jessica Bader · Stephan Alaniz · Cordelia Schmid · Zeynep Akata

While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 9 out of 10 datasets. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance.

# 203

DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation

Xiaobin Hu · Xu Peng · Donghao Luo · Xiaozhong Ji · Jinlong Peng · Zhengkai Jiang · Jiangning Zhang · Taisong Jin · Chengjie Wang · Rongrong Ji

Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of `matting anything'. Our DiffuMatting can 1). act as an anything matting factory with high accurate annotations 2). be well-compatible with community LoRAs or various conditional control approaches to achieve the community-friendly art design and controllable generation. Specifically, inspired by green-screen-matting, we aim to teach the diffusion model to paint on a fixed green screen canvas. To this end, a large-scale green-screen dataset (Green100K) is collected as a training dataset for DiffuMatting. Secondly, a green background control loss is proposed to keep the drawing board as a pure green color to distinguish the foreground and background. To ensure the synthesized object has more edge details, a detailed-enhancement of transition boundary loss is proposed as a guideline to generate objects with more complicated edge structures. Aiming to simultaneously generate the object and its matting annotation, we build a matting head to make a green-color removal in the latent space of the VAE decoder. Our DiffuMatting shows several potential applications (e.g., matting-data generator, community-friendly art design and controllable generation). As a matting-data generator, DiffuMatting synthesizes general object and portrait matting sets, effectively reducing the relative MSE error by 15.4% in General Object Matting and 11.4% in Portrait Matting tasks.

# 232

ZeST: Zero-Shot Material Transfer from a Single Image

Ta-Ying Cheng · Prafull Sharma · Andrew Markham · Niki Trigoni · Varun Jampani

We propose ZeST, a method for zero-shot material transfer to an object in the input image given a material exemplar image. ZeST leverages existing diffusion adapters to extract implicit material representation from the exemplar image. This representation is used to transfer the material using pre-trained inpainting diffusion model on the object in the input image using depth estimates as geometry cue and grayscale object shading as illumination cues. The method works on real images without any training resulting a zero-shot approach. Both qualitative and quantitative results on real and synthetic datasets demonstrate that ZeST outputs photorealistic images with transferred materials. We also show the application of ZeST to perform multiple edits and robust material assignment under different illuminations.

# 229

Strong Double Blind

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Zhekai Chen · Wen Wang · Zhen Yang · Zeqing Yuan · Hao Chen · Chunhua Shen

We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. Rather than concentrating on specific use cases such as appearance editing (image harmonization) or semantic editing (semantic image composition), we showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition applicable to both scenarios. We observe that the pre-trained diffusion models automatically identify simple copy-paste boundary areas as low-density regions during denoising. Building on this insight, we propose to optimize the composed image towards high-density regions guided by the diffusion prior. In addition, we introduce a novel mask-guided loss to further enable flexible semantic image composition. Extensive experiments validate the superiority of our approach in achieving generic zero-shot image composition. Additionally, our approach shows promising potential in various tasks, such as object removal and multi-concept customization.

# 238

Strong Double Blind

Learning Equilibrium Transformation for Gamut Expansion and Color Restoration

JUN XIAO · Changjian Shui · Zhi-Song Liu · Qian Ye · Kin-Man Lam

Existing imaging systems support wide-gamut images like ProPhoto RGB, but most images are typically encoded in a narrower gamut space (e.g., sRGB). To this end, these images can be enhanced by learning to recover the original color values beyond the sRGB gamut, or out-of-gamut values. Current methods incorporate the metadata from the target wide-gamut images to expand the gamut, while preventing distortion of in-gamut values. However, this metadata is hard to obtain in real-world scenarios. In this paper, we propose a novel method that requires no metadata. We formulate gamut expansion as a ``root-finding" problem and learn an equilibrium transformation via a neural network. Specifically, our method defines a dynamic system that keeps in-gamut values stable to prevent color distortion and updates out-of-gamut values recurrently. Therefore, we employ an implicit recurrent mechanism to iteratively extract features, which can effectively mitigate the vanishing gradient problem, and reduce the GPU memory consumption to O(1) complexity. Experiments demonstrate the effectiveness and efficiency of our model, in terms of gamut expansion and color restoration, outperforming state-of-the-art models by 0.40dB, in terms of PSNR, with a size of 40K parameters only.

# 230

Strong Double Blind

Timestep-Aware Correction for Quantized Diffusion Models

Yuzhe YAO · Feng Tian · Jun Chen · Haonan Lin · Guang Dai · Yong Liu · Jingdong Wang

Diffusion models have marked a significant breakthrough in the synthesis of semantically coherent images. However, their extensive noise estimation networks and the iterative generation process limit their wider application, particularly on resource-constrained platforms like mobile devices. Existing post-training quantization (PTQ) methods have managed to compress diffusion models to low precision. Nevertheless, due to the iterative nature of diffusion models, quantization errors tend to accumulate throughout the generation process. This accumulation of error becomes particularly problematic in low-precision scenarios, leading to significant distortions in the generated images. We attribute this accumulation issue to two main causes: error propagation and exposure bias. To address these problems, we propose a timestep-aware correction method for quantized diffusion model, which dynamically corrects the quantization error. By leveraging the proposed method in low-precision diffusion models, substantial enhancement of output quality could be achieved with only negligible computation overhead. Extensive experiments underscore our method's effectiveness and generalizability. By employing the proposed correction strategy, we achieve state-of-the-art (SOTA) results on low-precision models.

# 234

Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer.

Zhuoyi Yang · Heyang Jiang · Wenyi Hong · Jiayan Teng · Wendi Zheng · Yuxiao Dong · Ming Ding · Jie Tang

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. $4096\times4096$), the resolution of generated images is often limited to $1024\times1024$. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation . Compared to commonly used UNet structures, our model can save more than $5 \times $ memory when generating $4096\times4096$ images.

# 241

Energy-Clibrated VAE with Test Time Free Lunch

Yihong Luo · Siya Qiu · Xingjian Tao · Yujun Cai · Jing Tang

In this paper, we propose a novel generative model that utilizes a conditional Energy-Based Model (EBM) for enhancing Variational Autoencoder (VAE), termed Energy-Calibrated VAE (EC-VAE). Specifically, VAEs often suffer from blurry generated samples due to the lack of a tailored training on the samples generated in the generative direction. On the other hand, EBMs can generate high-quality samples but require expensive Markov Chain Monte Carlo (MCMC) sampling. To address these issues, we introduce a conditional EBM for calibrating the generative direction of VAE during training, without requiring it for the generation at test time. In particular, we train EC-VAE upon both the input data and the calibrated samples with adaptive weight to enhance efficacy while avoiding MCMC sampling at test time. Furthermore, we extend the calibration idea of EC-VAE to variational learning and normalizing flows, and apply EC-VAE to an additional application of zero-shot image restoration via neural transport prior and range-null theory. We evaluate the proposed method with two applications, including image generation and zero-shot image restoration, and the experimental results show that our method achieves competitive performance over single-step non-adversarial generation.

# 231

Strong Double Blind

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Qinyu Yang · Haoxin Chen · Yong Zhang · Menghan Xia · Xiaodong Cun · Zhixun Su · Ying Shan

In order to improve the quality of synthesized videos, currently, one predominant method involves retraining an expert diffusion model and then implementing a noising-denoising process for refinement. Despite the significant training costs, maintaining content consistency between the original and enhanced videos remains a major challenge. To tackle this challenge, we propose a novel formulation that considers both visual quality and content consistency. Content consistency is ensured by a proposed loss fuction that maintains the structure of the input, while visual quality is improved by utilizing the denoising process of pretrained diffusion models. To address the formulated optimization problem, we have developed a plug-and-play noise optimization strategy, which we refer to as \textbf{Noise Calibration}. By simply refining the initial random noise through a few iterations, the original video content can be largely preserved, and the enhancement effect is significantly improved. Extensive experiments have demonstrated the effectiveness of the proposed method.

# 259

Prompt-Based Test-Time Real Image Dehazing: A Novel Pipeline

Zixuan Chen · Zewei He · Ziqian Lu · Xuecheng Sun · Zheming Lu

Existing methods attempt to improve models' generalization ability on real-world hazy images by exploring well-designed training schemes (e.g., CycleGAN, prior loss). However, most of them need very complicated training procedures to achieve satisfactory results. For the first time, we present a novel pipeline called Prompt-based Test-Time Dehazing (PTTD) to help generate visually pleasing results of real-captured hazy images during the inference phase. We experimentally observe that given a dehazing model trained on synthetic data, fine-tuning the statistics (i.e., mean and standard deviation) of encoding features is able to narrow the domain gap, boosting the performance of real image dehazing. Accordingly, we first apply a prompt generation module (PGM) to generate a visual prompt, which is the reference of appropriate statistical perturbations for mean and standard deviation. Then, we employ a feature adaptation module (FAM) into the existing dehazing models for adjusting the original statistics with the guidance of the generated prompt. PTTD is model-agnostic and can be equipped with various state-of-the-art dehazing models trained on synthetic hazy-clean pairs to tackle the real image dehazing task. Extensive experimental results demonstrate that our PTTD is effective, achieving superior performance against state-of-the-art dehazing methods in real-world scenarios.

# 256

Strong Double Blind

Asymmetric Mask Scheme for Self-Supervised Real Image Denoising

Xiangyu Liao · Tianheng Zheng · Jiayu Zhong · Pingping Zhang · Chao Ren

In recent years, self-supervised denoising methods have gained significant success and become critically important in the field of image restoration. Among them, the blind spot network based methods are the most typical type and have attracted the attentions of a large number of researchers. Although the introduction of blind spot operations can prevent identity mapping from noise to noise, it imposes stringent requirements on the receptive fields in the network design, thereby limiting overall performance. To address this challenge, we propose a single mask scheme for self-supervised denoising training, which eliminates the need for blind spot operation and thereby removes constraints on the network structure design. Furthermore, to achieve denoising across entire image during inference, we propose a multi-mask scheme. Our method, featuring the asymmetric mask scheme in training and inference, achieves state-of-the-art performance on existing real noisy image datasets. All the source code will be made available to the public.

# 269

Strong Double Blind

GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity

Shuo Cao · Yihao Liu · Wenlong Zhang · Yu Qiao · Chao Dong

Traditional single-task image restoration methods excel in handling specific degradation types but struggle with multiple degradations. To address this limitation, we propose Grouped Restoration with Image Degradation Similarity (GRIDS), a novel approach that harmonizes the competing objectives inherent in multiple-degradation restoration. We first introduce a quantitative method for assessing relationships between image degradations using statistical modeling of deep degradation representations. This analysis facilitates the strategic grouping of similar tasks, enhancing both the efficiency and effectiveness of the restoration process. Based on the degradation similarity, GRIDS divides restoration tasks into optimal groups, where tasks within the same group are highly correlated. For instance, GRIDS effectively groups 11 degradation types into 4 cohesive groups. Trained models within each group show significant improvements, with an average improvement of 0.09dB over single-task upper bound models and 2.24dB over the mix-training baseline model. GRIDS incorporates an adaptive model selection mechanism for inference, automatically selecting the appropriate grouped-training model based on the input degradation. This mechanism is particularly useful for real-world scenarios with unknown degradations as it does not rely on explicit degradation classification modules. Furthermore, our method can predict model generalization ability without the need for network inference, providing valuable insights for practitioners.

# 263

Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Zhiheng Li · Muheng Li · Jixuan Fan · Lei Chen · Yansong Tang · Jiwen Lu · Jie Zhou

Scale arbitrary super-resolution based on implicit image function gains increasing popularity since it can better represent the visual world in a continuous manner. However, existing scale arbitrary works are trained and evaluated on simulated datasets, where low-resolution images are generated from their ground truths by the simplest bicubic downsampling. These models exhibit limited generalization to real-world scenarios due to the greater complexity of real-world degradations. To address this issue, we build a RealArbiSR dataset, a new real-world super resolution benchmark with both integer and non-integer scaling factors for the training and evaluation of real-world scale arbitrary super-resolution. Moreover, we propose a Dual-level Deformable Implicit Representation (DDIR) to solve real-world scale arbitrary super-resolution. Specifically, we design the appearance embedding and deformation field to handle both image-level and pixel-level deformations caused by real-world degradations. The appearance embedding models the characteristics of low-resolution inputs to deal with photometric variations at different scales, and the pixel-based deformation field learns RGB differences which result from the deviations between the real-world and simulated degradations at arbitrary coordinates. Extensive experiments show our trained model achieves state-of-the-art performance on the RealArbiSR and RealSR benchmarks for real-world scale arbitrary super resolution. The dataset and code are available at https://github.com/nonozhizhiovo/RealArbiSR.

# 258

Strong Double Blind

A New Dataset and Framework for Real-World Blurred Images Super-Resolution

Rui Qin · Ming Sun · Chao Zhou · Bin Wang

Recent Blind Image Super-Resolution (BSR) methods have shown proficiency in general images. However, we find that the efficacy of recent methods obviously diminishes when employed on image data with blur, while images with intentional blur constitute a substantial proportion of general data. To further investigate and address this issue, we developed a new super-resolution dataset specifically tailored for blur images, named the Real-world Blur-kept Super-Resolution (ReBlurSR) dataset, which consists of nearly 3000 defocus and motion blur image samples with diverse blur sizes and varying blur intensities. Furthermore, we propose a new BSR framework for blur images called Perceptual-Blur-adaptive Super-Resolution (PBaSR), which comprises two main modules: the Cross Disentanglement Module (CDM) and the Cross Fusion Module (CFM). The CDM utilizes a dual-branch parallelism to isolate conflicting blur and general data during optimization. The CFM fuses the well-optimized prior from these distinct domains cost-effectively and efficiently based on model interpolation. By integrating these two modules, PBaSR achieves commendable performance on both general and blur data without any additional inference and deployment cost and is generalizable across multiple model architectures. Rich experiments show that PBaSR achieves state-of-the-art (SOTA) performance across various quantitative metrics without incurring extra inference costs. Within the commonly adopted Learned Perceptual Image Patch Similarity (LPIPS) setting, PBaSR achieves an improvement range of approximately 0.02-0.09 with diverse anchor methods and blur types, across both the ReBlurSR and multiple widely used general BSR benchmarks.

# 253

Strong Double Blind

Blind image deblurring with noise-robust kernel estimation

Chanseok Lee · Jeongsol Kim · Seungmin Lee · Jaehwang Jung · Yunje Cho · Taejoong Kim · Taeyong Jo · Myungjun Lee · Jang Mooseok

Blind deblurring is an ill-posed inverse problem involving the retrieval of a clear image and blur kernel from a single blurry image. The challenge arises considerably when strong noise, where its level remains unknown, is introduced. Existing blind deblurring approaches heavily depend on designed priors for natural images and blur kernels. However, these methods are highly sensitive to noise due to the disturbance of the solution space. Here, we propose a noise-robust blind deblurring framework based on a novel kernel estimation function and deep image prior (DIP). Specifically, the proposed kernel estimation function mitigates noise and acquires the blur kernel, leveraging the capability of DIP to capture the priors of natural images. Additionally, the multiple kernel estimation scheme enables the successful execution of the deblurring task when the noise level is unknown. Extensive experimental studies, including simulated images and real-world examples, demonstrate the superior deblurring performance of the proposed method.

# 243

Strong Double Blind

SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution

mingjun zheng · Long Sun · Jiangxin Dong · Jinshan Pan

Transformer-based restoration methods achieve significant performance as the self-attention (SA) of the Transformer can explore non-local information for better high-resolution image reconstruction. However, the key dot-product SA requires substantial computational resources, which limits its application in low-power devices. Moreover, the low-pass nature of the SA mechanism limits its capacity for capturing local details, consequently leading to smooth reconstruction results.To address these issues, we propose a self-modulation feature aggregation (SMFA) module to collaboratively exploit both local and non-local feature interactions for a more accurate reconstruction. Specifically, the SMFA module employs an efficient approximation of self-attention (EASA) branch to model non-local information and uses a local detail estimation (LDE) branch to capture local details.Additionally, we further introduce a partial convolution-based feed-forward network (PCFN) to refine the representative features derived from the SMFA. Extensive experiments show that the proposed SMFANet family achieves a better trade-off between reconstruction performance and computational efficiency on public benchmark datasets.In particular, compared to the $\times$4 SwinIR-light, SMFANet+ achieves \textbf{0.14dB} higher performance over five public testsets on average,and \textbf{$\times$10} times faster runtime, with only about \textbf{43\%} of the model complexity (\textit{e.g.,} FLOPs).

# 272

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

Hang Guo · Jinmin Li · Tao Dai · Zhihao Ouyang · Xudong Ren · Shu-Tao Xia

Recent years have seen significant advancements in image restoration, largely attributed to the development of modern deep neural networks, such as CNNs and Transformers. However, existing restoration backbones often face the dilemma between global receptive fields and efficient computation, hindering their application in practice. Recently, the Selective Structured State Space Model, especially the improved version Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a way to resolve the above dilemma. However, the standard Mamba still faces certain challenges in low-level vision such as local pixel forgetting and channel redundancy. In this work, we introduce a simple but effective baseline, named MambaIR, which introduces both local enhancement and channel attention to improve the vanilla Mamba. In this way, our MambaIR takes advantage of the local pixel similarity and reduces the channel redundancy. Extensive experiments demonstrate the superiority of our method, for example, MambaIR outperforms SwinIR by up to 0.45dB on image SR, using similar computational cost but with a global receptive field.

# 249

BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering

Xinmin Qiu · Congying Han · Zicheng Zhang · Bonan Li · Tiande Guo · Pingyu Wang · Xuecheng Nie

Developing blind video deflickering (BVD) algorithms to enhance video temporal consistency, is gaining importance amid the flourish of image processing and video generation. However, the intricate nature of video data complicates the training of deep learning methods, leading to high resource consumption and instability, notably under severe lighting flicker. This underscores the critical need for a compact representation beyond pixel values to advance BVD research and applications. Inspired by the classic scale-time equalization (STE), our work introduces the histogram-assisted solution, called BlazeBVD, for high-fidelity and rapid BVD. Compared with STE, which directly corrects pixel values by temporally smoothing color histograms, BlazeBVD leverages smoothed illumination histograms within STE filtering to ease the challenge of learning temporal data using neural networks. In technique, BlazeBVD begins by condensing pixel values into illumination histograms that precisely capture flickering and local exposure variations. These histograms are then smoothed to produce singular frames set, filtered illumination maps, and exposure maps. Resorting to these deflickering priors, BlazeBVD utilizes a 2D network to restore faithful and consistent texture impacted by lighting changes or localized exposure issues. BlazeBVD also incorporates a lightweight 3D network to amend slight temporal inconsistencies, avoiding the resource consumption issue. Comprehensive experiments on synthetic, real-world and generated videos, showcase the superior qualitative and quantitative results of BlazeBVD, achieving inference speeds up to 10x faster than state-of-the-arts.

# 252

Strong Double Blind

Towards Robust Full Low-bit Quantization of Super Resolution Networks

Denis Makhov · Irina Zhelavskaya · Ruslan Ostapets · Dehua Song · Kirill Solodskikh

Quantization is among the most common strategies to accelerate neural networks (NNs) on terminal devices. We are interested in increasing the robustness of Super Resolution (SR) networks to low-bit quantization considering mathematical model of natural images. Natural images contain partially smooth areas with edges between them. The number of pixels corresponding to edges is significantly smaller than the overall number of pixels. As SR task could be considered as ill-posed restoration of edges and texture, we propose to manually focus quantized CNNs on high-frequency part of the input image thus hiding quantization error in edges and texture providing visually appealing results. We extract edges and texture using well-known edge detectors based on finite-difference approximations of differential operators. To perform inverse transformation we propose to use solver for partial differential equations with regularization term that significantly increase solution robustness to errors in operator domain. The proposed approach significantly outperforms regular quantization counterpart in the case of full 4-bit quantization, for example, we achieved +3.76 dB for EDSR x2 and +3.67 dB for RFDN x2 on test part of DIV2K.

# 251

Strong Double Blind

Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network

Rui Li · Mikhail Kudryashev · Artur Yakimovich

Optic deconvolution in light microscopy (LM) refers to recovering the object details from images, revealing the ground truth of samples. Traditional explicit methods in LM rely on the point spread function (PSF) during image acquisition. Yet, these approaches often fall short due to inaccurate PSF models and noise artifacts, hampering the overall restoration quality. In this paper, we approached the optic deconvolution as an inverse problem. Motivated by the nonstandard-form compression scheme introduced by Beylkin, Coifman, and Rokhlin (BCR), we proposed an innovative physics-informed neural network Multi-Stage Residual-BCR Net (m-rBCR) to approximate the optic deconvolution. We validated the m-rBCR model on four microscopy datasets - two simulated microscopy datasets from ImageNet and BioSR, real dSTORM microscopy images, and real widefield microscopy images. In contrast to the explicit deconvolution methods (e.g. Richardson-Lucy) and other state-of-the-art NN models (U-Net, DDPM, CARE, DnCNN, ESRGAN, RCAN, Noise2Noise, MPRNet, and MIMO-U-Net), the m-rBCR model demonstrates superior performance to other candidates by PSNR and SSIM in two real microscopy datasets and the simulated BioSR dataset. In the simulated ImageNet dataset, m-rBCR ranks the second-best place (right after MIMO-U-Net). With the backbone from the optical physics, m-rBCR exploits the trainable parameters with better performances (from ~30 times fewer than the benchmark MIMO-U-Net to ~210 times than ESRGAN). This enables m-rBCR to achieve a shorter runtime (from ~3 times faster than MIMO-U-Net to ~300 times faster than DDPM). To summarize, by leveraging physics constraints our model reduced potentially redundant parameters significantly in expertise-oriented NN candidates and achieved high efficiency with superior performance.

# 283

Strong Double Blind

SAH-SCI: Self-Supervised Adapter for Efficient Hyperspectral Snapshot Compressive Imaging

Haijin Zeng · Yuxi Liu · Yongyong Chen · Youfa Liu · Chong Peng · Jingyong Su

Hyperspectral image (HSI) reconstruction is vital for recovering spatial-spectral information from compressed measurements in coded aperture snapshot spectral imaging (CASSI) systems. Despite the effectiveness of end-to-end and deep unfolding methods, their reliance on substantial training data poses challenges, notably the scarcity of labeled HSIs. Existing approaches often train on limited datasets, such as KAIST and CAVE, leading to biased models with poor generalization capabilities. Addressing these challenges, we propose a universal Self-Supervised Adapter for Hyperspectral Snapshot Compressive Imaging (SAH-SCI). Unlike full fine-tuning or linear probing, SAH-SCI enhances model generalization by training a lightweight adapter while preserving the original model's parameters. We propose a novel approach that combines spectral and spatial adaptation to enhance an image model's capacity for spatial-spectral reasoning. Additionally, we introduce a customized adapter self-supervised loss function that captures the consistency, group invariance and image uncertainty of CASSI imaging. This approach effectively reduces the solution space for ill-posed HSI reconstruction. Experimental results demonstrate SAH's superiority over previous methods with fewer parameters, offering simplicity and adaptability to any end-to-end or unfolding methods. Our approach paves the way for leveraging more robust image foundation models in future hyperspectral imaging tasks.

# 267

Strong Double Blind

Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling

Noam Elata · Tomer Michaeli · Michael Elad

Compressed Sensing (CS) facilitates rapid image acquisition by selecting a small subset of measurements sufficient for high-fidelity reconstruction. Adaptive CS seeks to further enhance this process by dynamically choosing future measurements based on information gleaned from data that is already acquired. However, many existing frameworks are often tailored to specific tasks and require intricate training procedures. We propose AdaSense, a novel Adaptive CS approach that leverages zero-shot posterior sampling with pre-trained diffusion models. By sequentially sampling from the posterior distribution, we can quantify the uncertainty of each possible future linear measurement throughout the acquisition process. AdaSense eliminates the need for additional training and boasts seamless adaptation to diverse domains with minimal tuning requirements. Our experiments demonstrate the effectiveness of AdaSense in reconstructing facial images from a small number of measurements. Furthermore, we apply AdaSense for active acquisition of medical images in the domains of magnetic resonance imaging (MRI) and computed tomography (CT), highlighting its potential for tangible real-world acceleration.

# 266

Strong Double Blind

DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays

Baochang Zhang · Zhi Qiao · Runkun Liu · Hong Li · Xiantong Zhen · Zhen Qian · Juan Zhang · Baochang Zhang

Computed tomography (CT) is widely utilized in clinical settings because it delivers detailed 3D images of the human body. However, performing CT scans is not always feasible due to radiation exposure and limitations in certain surgical environments. As an alternative, reconstructing CT images from ultra-sparse X-rays offers a valuable solution and has gained significant interest in scientific research and medical applications. However, it presents great challenges as it is inherently an ill-posed problem, often compromised by artifacts resulting from overlapping structures in X-ray images. In this paper, we propose DiffuX2CT, which models CT reconstruction from orthogonal biplanar X-rays as a conditional diffusion process. DiffuX2CT is established with a 3D global coherence denoising model with a new, implicit conditioning mechanism. We realize the conditioning mechanism by a newly designed tri-plane decoupling generator and an implicit neural decoder. By doing so, DiffuX2CT achieves structure-controllable reconstruction, which enables 3D structural information to be recovered from 2D X-rays, therefore producing faithful textures in CT images. As an extra contribution, we collect a real-world lumbar CT dataset, called LumbarV, as a new benchmark to verify the clinical significance and performance of CT reconstruction from X-rays. Extensive experiments on this dataset and three more publicly available datasets demonstrate the effectiveness of our proposal.

# 139

Strong Double Blind

BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression

Yufeng Zhang · Hang Yu · Shizhan Liu · Wenrui Dai · Weiyao Lin

Despite superior rate-distortion performance over traditional codecs, Neural Image Compression (NIC) is limited by its computational scalability in practical deployment. Prevailing research focuses on accelerating specific NIC modules but is restricted in controlling overall computational complexity. To this end, this work introduces BaSIC (BayesNet structure learning for computational Scalable neural Image Compression), a comprehensive, computationally scalable framework that affords full control over NIC processes. We learn the Bayesian network (BayesNet) structure of NIC for controlling both neural network backbones and autoregressive units. The learning of BayesNet is achieved by solving two sub-problems, i.e., learning a heterogeneous bipartite BayesNet for the inter-node structure to regulate backbone complexity, and a multipartite BayesNet for the intra-node structure to optimize parallel computation in autoregressive units. Experiments demonstrate that our method not only facilitates full computational scalability with more accurate complexity control but also maintains competitive compression performance compared to other computation scalable frameworks under equivalent computational constraints. Code will be available after acceptance.

# 237

Strong Double Blind

SNeRV: Spectra-preserving Neural Representation for Video

Jina Kim · Jihoo Lee · Jewon Kang

Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) components. In this paper, we propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations by efficiently handling various frequency components. SNeRV uses 2D discrete wavelet transform (DWT) to decompose video into LF and HF features, preserving spatial structures and directly addressing the spectral bias issue. To balance the compactness, we encode only the LF components, while HF components that include fine textures are generated by a decoder. Specialized modules, including a multi-resolution fusion unit (MFU) and a high-frequency restorer (HFR), are integrated into a backbone to facilitate the representation. Furthermore, we extend SNeRV to effectively capture temporal correlations between adjacent video frames, by casting the extension as additional frequency decomposition to a temporal domain. This approach allows us to embed spatio-temporal LF features into the network, using temporally extended up-sampling blocks (TUBs). Experimental results demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction, making it a promising approach in the field of implicit video representations. The codes will be available at https://github.com/qwertja/SNeRV.

# 135

Strong Double Blind

Multiscale Graph Texture Network

Ravishankar Evani · Deepu Rajan · Shangbo Mao

Texture recognition has predominantly relied on methods based on handcrafted features and more recently, on Convolutional Neural Network (CNN)-based methods. However, many of these approaches do not capture the underlying directional relationships between visual vocabularies, attributes and features. In this study, we introduce a graph-based deep learning framework for texture and material recognition called Graph Texture Network (GTN) that models the underlying directional associations among latent texture attributes, that are hierarchically related to visual texture attributes, facilitating information exchange among them and consequently improving the discriminative capability among different texture and material categories. GTN, which is a non-Euclidean data structure, provides flexibility to learn complex underlying relationships among latent texture attributes via a learnable masked adjacency matrix. To ensure robustness of GTN to noise, especially on graphs with fewer vertices, we facilitate re-calibration of self-loop edge weights to preserve salient texture information within each vertex. We then utilize residual message passing to enrich the representations of latent texture attributes. Furthermore, GTN is able to facilitate interaction across multiple graphs, representing texture information across a range of scales. Finally, GTN can be easily incorporated into a variety of CNN architectures for end-to-end training and does not require fine-tuning of pre-trained CNN backbones. Experimental results demonstrate that GTN achieves state-of-the-art performance on several benchmark texture and material datasets.

# 174

Strong Double Blind

DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

Meng-Cheng Shih · Tsai-Ling Huang · Yu-Heng Shih · Hong-Han Shuai · Hsuan-Tung Liu · Yi-Ren Yeh · Ching-Chun Huang

Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail-Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.

# 1

Strong Double Blind

Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors

Tao Lin · lijia Yu · Gaojie Jin · Renjue Li · Peng Wu · Lijun Zhang

In recent years, the study of adversarial robustness in object detection systems, particularly those based on deep neural networks (DNNs), has become a pivotal area of research. Traditional physical attacks targeting object detectors, such as adversarial patches and texture manipulations, directly manipulate the surface of the object. While these methods are effective, their overt manipulation of objects may draw attention in real-world applications. To address this, this paper introduces a more subtle approach: an inconspicuous adversarial trigger that operates outside the bounding boxes, rendering the object undetectable to the model. We further enhance this approach by proposing the Feature Guidance (FG) technique and the Universal Auto-PGD (UAPGD) optimization strategy for crafting high-quality triggers. The effectiveness of our method is validated through extensive empirical testing, demonstrating its high performance in both digital and physical environments.

# 2

Strong Double Blind

Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection

Yuzhen Lin · Wentang Song · Bin Li · Yuezun Li · Jiangqun Ni · Han Chen · Qiushi Li

Previous studies in deepfake detection have shown promising results when testing face forgeries from the same dataset as the training. However, the problem remains challenging when one tries to generalize the detector to forgeries from unseen datasets and created by unseen methods. Utilizing forgery augmentation is one of the powerful lines to improve generalization performances. In this work, we present a novel general deepfake detection method, called Curricular Dynamic Forgery Augmentation (CDFA), which jointly trains a deepfake detector with a forgery augmentation policy network. Unlike the previous works, we propose to progressively apply forgery augmentations following a monotonic curriculum during the training. We further propose a dynamic forgery searching strategy to select one suitable forgery augmentation operation for each image varying between training stages, producing a forgery augmentation policy optimized for better generalization. In addition, we propose a novel forgery augmentation named self-shifted blending image to simply imitate the temporal inconsistency of deepfake generation. Comprehensive experiments show that CDFA can significantly improve both cross-datasets and cross-manipulations performances of various naive deepfake detectors in a plug-and-play way, and make them attain superior performances over the existing methods in several benchmark datasets. To facilitate the reproducible research, we will release our code upon the acceptance of the paper.

# 4

Strong Double Blind

AdversariaLeak: External Information Leakage Attack Using Adversarial Samples on Face Recognition Systems

Roye Katzav · Amit Giloni · Edita Grolman · Hiroo Saito · Tomoyuki Shibata · Tsukasa Omino · Misaki Komatsu · Yoshikazu Hanatani · Yuval Elovici · Asaf Shabtai

Face recognition (FR) systems are vulnerable to external information leakage (EIL) attacks, which can reveal sensitive information about the training data, thus compromising the confidentiality of the company's proprietary and the privacy of the individuals concerned. Existing EIL attacks mainly rely on unrealistic assumptions, such as a high query budget for the attacker and massive computational power, resulting in impractical EIL attacks. We present AdversariaLeak, a novel and practical query-based EIL attack that targets the face verification model of the FR systems by using carefully selected adversarial samples. AdversariaLeak uses substitute models to craft adversarial samples, which are then handpicked to infer sensitive information. Our extensive evaluation on the MAAD-Face and CelebA datasets, which includes over 200 different target models, shows that AdversariaLeak outperforms state-of-the-art EIL attacks in inferring the property that best characterizes the FR model's training set while maintaining a small query budget and practical attacker assumptions.

# 245

Strong Double Blind

Continual Learning for Remote Physiological Measurement: Minimize Forgetting and Simplify Inference

Qian Liang · Yan Chen · Yang Hu

Remote photoplethysmography (rPPG) has gained increasing attention in recent years for its ability to extract physiological signals from facial videos. Existing rPPG measurement methods have shown satisfactory performance in the intra-dataset and cross-dataset scenarios. However, they often neglect the incremental learning scenario, in which training data is presented sequentially, resulting in the issue of catastrophic forgetting. Meanwhile, mainstream class incremental learning algorithms suffer performance degradation or even fail to effectively transfer to rPPG measurement. In this paper, we present a novel and practical method to tackle continual learning for rPPG measurement. We firstly employ adapter finetuning to adapt to new tasks efficiently while enhancing the model's stability. To alleviate catastrophic forgetting without storing previous samples, we design a prototype-based augmentation strategy to reproduce the domain factors of previous tasks. Additionally, drawing inspiration from humans' problem-solving manner, an inference simplification strategy is devised to convert the potentially forgotten tasks into familiar ones for the model. To evaluate our method and enable fair comparisons, we create the first continual learning protocol for rPPG measurement. Extensive experiments demonstrate that our approach significantly surpasses the state-of-the-art methods.

# 109

NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation

Jingyang Huo · Yikai Wang · Yanwei Fu · Xuelin Qian · Chong Li · Yun Wang · Jianfeng Feng

Recent fMRI-to-image approaches mainly focused on associating fMRI signals with specific conditions of pre-trained diffusion models. These approaches, while producing high-quality images, capture only a limited aspect of the complex information in fMRI signals and offer little detailed control over image creation. In contrast, this paper proposes to directly modulate the generation process of diffusion models using fMRI signals. Our approach, NeuroPictor, divides the fMRI-to-image process into three steps: i) fMRI calibrated-encoding, to tackle multi-individual pre-training for a shared latent space to minimize individual difference and enable the subsequent cross-subject training; ii) fMRI-to-image cross-subject pre-training, perceptually learning to guide diffusion model with high- and low-level conditions across different individuals; iii) fMRI-to-image single-subject refining, similar with step ii but focus on adapting to particular individual. NeuroPictor extracts high-level semantic features from fMRI signals that characterizing the visual stimulus and incrementally fine-tunes the diffusion model with a low-level manipulation network to provide precise structural instructions. By training with over 60,000 fMRI-image pairs from various individuals, our model enjoys superior fMRI-to-image decoding capacity, particularly in the within-subject setting, as evidenced in benchmark datasets.

# 62

Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning

Meixuan Li · Tianyu Li · Guoqing Wang · Peng Wang · Yang Yang · Jie Zou

In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.

# 66

Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours Maps

Jordao Bragantini · Merlin Lange · Loïc A Royer

In this work, we describe a method for large-scale 3D cell-tracking through a segmentation selection approach. The proposed method is effective at tracking cells across large microscopy datasets on two fronts: (i) It can solve problems containing millions of segmentation instances in terabyte-scale 3D+t datasets; (ii) It achieves competitive results with or without deep learning, bypassing the requirement of 3D annotated data, that is scarce in the fluorescence microscopy field. The proposed method computes cell tracks and segments using a hierarchy of segmentation hypotheses and selects disjoint segments by maximizing the overlap between adjacent frames. We show that this method is the first to achieve state-of-the-art in both nuclei- and membrane-based cell tracking by evaluating it on the 2D epithelial cell benchmark and 3D images from the cell tracking challenge. Furthermore, it has a faster integer linear programming formulation, and the framework is flexible, supporting segmentations from individual off-the-shelf cell segmentation models or their combination as an ensemble. The code is available as supplementary material.

# 339

Strong Double Blind

SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild

Pengfei Wang · Xiaofei Hui · Jing Wu · Zile Yang · Kian Eng Ong · Xinge Zhao · Beijia Lu · Dezhao Huang · Evan Ling · Weiling Chen · Keng Teck · Minhoe Hur · Jun Liu

Knowing merely where the target is located is not sufficient for many real-life scenarios. In contrast, capturing rich details about the tracked target via its semantic trajectory, i.e. who/what this target is interacting with and when, where, and how they are interacting over time, is especially crucial and beneficial for various applications (e.g., customer analytics, public safety). We term such tracking as Semantic Tracking and define it as tracking the target based on the user's input and then, most importantly, capturing the semantic trajectory of this target. Acquiring such information can have significant impacts on sales, public safety, etc. However, currently there is no dataset for such comprehensive tracking of the target. To address this gap, we create SemTrack, a large and comprehensive dataset containing annotations of the target's semantic trajectory. The dataset contains 6.7 million frames from 6961 videos, covering a wide range of 52 different interaction classes with 115 different object classes spanning 10 different supercategories in 12 types of different scenes, including both indoor and outdoor environments. We also propose SemTracker, a simple and effective method, and incorporate a meta-learning approach to better handle the challenges of this task. Our dataset and code can be found at \url{https://sutdcv.github.io/SemTrack}.

# 173

Strong Double Blind

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

Qi Wang · Zhou Xu · Yuming Lin · Jingtao Ye · Hongsheng Li · Guangming Zhu · Syed Afaq Ali Shah · Mohammed Bennamoun · Liang Zhang

Neuromorphic sensors, specifically event cameras, revolutionize visual data acquisition by capturing pixel intensity changes with exceptional dynamic range, minimal latency, and energy efficiency, setting them apart from conventional frame-based cameras. The distinctive capabilities of event cameras have ignited significant interest in the domain of event-based action recognition, recognizing their vast potential for advancement. However, the development in this field is currently slowed by the lack of comprehensive, large-scale datasets, which are critical for developing robust recognition frameworks. To bridge this gap, we introduces DailyDVS-200, a meticulously curated benchmark dataset tailored for the event-based action recognition community. DailyDVS-200 is extensive, covering 200 action categories across real-world scenarios, recorded by 47 participants, and comprises more than 22,000 event sequences. This dataset is designed to reflect a broad spectrum of action types, scene complexities, and data acquisition diversity. Each sequence in the dataset is annotated with 14 attributes, ensuring a detailed characterization of the recorded actions. Moreover, DailyDVS-200 is structured to facilitate a wide range of research paths, offering a solid foundation for both validating existing approaches and inspiring novel methodologies. By setting a new benchmark in the field, we challenge the current limitations of neuromorphic data processing and invite a surge of new approaches in event-based action recognition techniques, which paves the way for future explorations in neuromorphic computing and beyond.

# 170

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan · Wenzheng Zeng · Yang Xiao · Xingyu Tong · Bo Tan · Zhiwen Fang · Zhiguo Cao · Joey Tianyi Zhou

Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design $2$ prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only $2.8$\% than the previous SOTA. The source code will be released upon acceptance.

# 166

Towards More Practical Group Activity Detection: A New Benchmark and Model

Dongkeun Kim · Youngkil Song · Minsu Cho · Suha Kwak

Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. Unlike existing datasets, Café is constructed primarily for GAD and presents more practical scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Café, where it outperformed previous work in terms of both accuracy and inference speed.

# 162

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

Camillo Quattrocchi · Antonino Furnari · Daniele Di Mauro · Mario Valerio Giuffrida · Giovanni Maria Farinella

We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and model level. To evaluate our approach, we introduce a new benchmark based on the Assembly101 dataset. Results demonstrate the feasibility and effectiveness of the proposed method against classic unsupervised domain adaptation and temporal sequence alignment approaches. Remarkably, without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99% (28.59% vs 12.60%) improvement in the edit score on the Assembly101 dataset compared to a baseline model trained solely on exocentric data.

# 329

Strong Double Blind

Online Temporal Action Localization with Memory-Augmented Transformer

Youngkil Song · Dongkeun Kim · Minsu Cho · Suha Kwak

Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the current action and accesses the memory queue to estimate start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.

# 176

EgoLifter: Open-world 3D Segmentation for Egocentric Perception

Qiao Gu · Zhaoyang Lv · Duncan Frost · Simon Green · Julian Straub · Chris Sweeney

In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.

# 207

Strong Double Blind

MeshSegmenter: Zero-Shot Mesh Segmentation via Texture Synthesis

ziming zhong · Yanyu Xu · Jing Li · Jiale Xu · Zhengxin Li · Chaohui Yu · Shenghua Gao

We present MeshSegmenter, a pioneering framework designed for 3D Zero-shot semantic segmentation. This innovative model successfully extends the powerful capabilities of 2D segmentation models to 3D meshes, delivering accurate 3D segmentation across diverse meshes and segment descriptions. Our contributions are threefold. Firstly, we introduce the MeshSegmenter framework, which consistently produces precise 3D segmentation results. Secondly, we propose the generation of textures based on object descriptions to augment 2D segmentation models with additional texture information, thereby improving their accuracy. By harnessing latent texture information unearthed from generative models based on 3D meshes, our model can perform accurate 3D segmentation in geometrically non-prominent areas, such as segmenting a car door within a car mesh. Lastly, we develop a multi-view revoting module that integrates 2D detection results and confidence scores from various views onto the 3D mesh, ensuring the 3D consistency of segmentation results and eliminating inaccuracies from specific perspectives. Through these innovations, MeshSegmenter offers stable and reliable 3D segmentation results, highlighting its potential as a transformative tool in the field of 3D zero-shot segmentation.

# 320

Spatial-Temporal Multi-level Association for Video Object Segmentation

Deshui Miao · Xin Li · Zhenyu He · Huchuan Lu · Ming-Hsuan Yang

Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal multi-level association framework, which jointly associates reference frame, test frame, and object features to achieve sufficient interaction and parallel target ID association with a spatial-temporal memory bank for efficient video object segmentation. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features, which formulates feature extraction and interaction as the efficient operations of object self-attention, reference object enhancement, and test reference correlation. In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation. We evaluate the proposed method by conducting extensive experiments on numerous video object segmentation datasets, including DAVIS 2016/2017 val, DAVIS 2017 test-dev, and YouTube-VOS 2018/2019 val. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of our approach.

# 330

Strong Double Blind

Gated Temporal Diffusion for Stochastic Long-term Dense Anticipation

Olga Zatsarynna · Emad Bahrami · Yazan Abu Farha · Gianpiero Francesca · Juergen Gall

Long-term action anticipation has become an important task for many applications such as autonomous driving and human-robot interaction. Unlike short-term anticipation, predicting more actions into the future imposes a real challenge with the increasing uncertainty in longer horizons. While there has been a significant progress in predicting more actions into the future, most of the proposed methods address the task in a deterministic setup and ignore the underlying uncertainty. In this paper, we propose a novel temporal diffusion network that models the uncertainty of both the observation and the future predictions. As generator, we introduce a Gated Anticipation Network (GTAN) to model both observed and unobserved frames of a video in a mutual representation. On the one hand, using a mutual representation for past and future allows us to jointly model ambiguities in the observation and future, while on the other hand by design GTAN can intrinsically treat the observed and unobserved parts differently and steer the information flow between them. Our model achieves state-of-the-art results on the Breakfast, Assembly101 and 50Salads datasets in both stochastic and deterministic settings.

# 89

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Jefferson Hernandez · Ruben Villegas · Vicente Ordonez

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global feature obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time, ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark. When training on videos and images from diverse datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best-supervised method. Source code and model checkpoints will be released with this paper.

# 300

Strong Double Blind

Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment

Huangbiao Xu · Xiao Ke · Yuezhou Li · Rui Xu · Huanqi Wu · Xiaofeng Lin · Wenzhong Guo

Action quality assessment (AQA) is a challenging vision task that requires discerning and quantifying subtle differences in actions from the same class. While recent research has made strides in creating fine-grained annotations for more precise analysis, existing methods primarily focus on coarse action segmentation, leading to limited identification of discriminative action frames. To address this issue, we propose a Vision-Language Action Knowledge Learning approach for action quality assessment, along with a multi-grained alignment framework to understand different levels of action knowledge. In our framework, prior knowledge, such as specialized terminology, is embedded into video-level, stage-level, and frame-level representations via CLIP. We further propose a new semantic-aware collaborative attention module to prevent confusing interactions and preserve textual knowledge in cross-modal and cross-semantic spaces. Specifically, we leverage the powerful cross-modal knowledge of CLIP to embed textual semantics into image features, which then guide action spatial-temporal representations. Our approach can be plug-and-played with existing AQA methods, frame-wise annotations or not. Extensive experiments and ablation studies show that our approach achieves state-of-the-art on four public short and long-term AQA benchmarks: FineDiving, MTL-AQA, JIGSAWS, and Fis-V.

# 298

VideoMamba: State Space Model for Efficient Video Understanding

Kunchang Li · Xinhao Li · Yi Wang · Yinan He · Yali Wang · Limin Wang · Yu Qiao

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding. All the code and models will be released.

# 287

Text-Conditioned Resampler For Long Form Video Understanding

Bruno Korbar · Yongqin Xian · Alessio Tonioni · Andrew ZISSERMAN · Federico Tombari

Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.

# 64

Strong Double Blind

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Zixu Cheng · Yujiang Pu · Shaogang Gong · Parisa Kordjamshidi · Yu Kong

Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner, utilizing contrastive learning to achieve compositional generalization. However, these approaches have significant drawbacks in constructing negative queries: (1) Only dominant verbs and nouns are considered, overlooking the impact of other primitives such as prepositions and adverbs. (2) These negative samples are formed through random sampling and recombination, resulting in numerous semantically implausible combinations and forcing the model to learn unrealistic semantic differences. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks, Charades-CG and ActivityNet-CG, validate the effectiveness and generalizability of our proposed method. Our code will be available soon.

# 257

Vamos: Versatile Action Models for Video Understanding

Shijie Wang · Qi Zhao · Minh Quan · Nakul Agarwal · Kwonjoon Lee · Chen Sun

What makes good representations for video understanding, such as anticipating future activities, or answering video-conditioned questions? While earlier approaches focus on end-to-end learning directly from video pixels, we propose to revisit text-based representations, such as general-purpose video captions, which are interpretable and can be directly consumed by large language models (LLMs). Intuitively, different video understanding tasks may require representations that are complementary and at different granularity. To this end, we propose versatile action models (Vamos), a learning framework powered by a large language model as the ``reasoner'', and can flexibly leverage visual embedding and free-form text descriptions as its input. To interpret the important text evidence for question answering, we generalize the concept bottleneck model to work with tokens and nonlinear models, which uses hard attention to select a small subset of tokens from the free-form text as inputs to the LLM reasoner. We evaluate Vamos on four complementary video understanding benchmarks, Ego4D, Next-QA, IntentQA, and EgoSchema, on its capability to model temporal dynamics, encode visual history, and perform reasoning. Surprisingly, we observe that text-based representations consistently achieve competitive performance on all benchmarks, and that visual embeddings provide marginal or no performance improvement, demonstrating the effectiveness of text-based video representation in the LLM era. We also qualitative demonstrate that our token bottleneck model can select relevant evidence from free-form text, support test-time intervention, and achieves 5x inference speedup while maintaining the question answering performance. Code and models will be publicly released.

# 270

Strong Double Blind

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Kirolos Ataallah · Xiaoqian Shen · Eslam mohamed abdelrahman · Essam Sleiman · Mingchen Zhuge · Jian Ding · Deyao Zhu · Jürgen Schmidhuber · Mohamed Elhoseiny

Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as “noise and redundancy", as well as “memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models’ capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television series. To facilitate the retrieval process, we developed MiniGPT4-Video that generates detailed descriptions for the video clips. In addressing the scarcity of benchmarks for long video evaluation, we adapted the TVQA short video benchmark for extended content analysis by aggregating questions from entire episodes, thereby shifting the evaluation from partial to full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows exceptional performance in short video comprehension, exceeding existing state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT, TGIF, and TVQA short video benchmarks, respectively. These results indicate that our models have significant improvements in both long and short-video understanding.

# 142

Strong Double Blind

Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

Thanh Thong Nguyen · Yi Bin · Xiaobao Wu · Xinshuai Dong · Zhiyuan Hu · Khoi M Le · Cong-Duy Nguyen · See Kiong Ng · Anh Tuan Luu

Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model’s focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.

# 143

Multi-Sentence Grounding for Long-term Instructional Video

Zeqian Li · QIRUI CHEN · Tengda Han · Ya Zhang · Yanfeng Wang · Weidi Xie

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a large language model to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-term video, and associated multiple sentences, our goal is to determine their corresponding timestamps in the video simultaneously, as a result, the model demonstrates superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset will be publicly released to the research community.

# 141

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye · Zitong Yu · Rui Shao · Xinyu Xie · Philip Torr · Xiaochun Cao

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions will be released soon.

# 65

Strong Double Blind

CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

Yuanhong Chen · Chong Wang · Yuyuan Liu · Hu Wang · Gustavo Carneiro

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy. Code will be available.

# 189

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

Zhengdi Yu · Shaoli Huang · yongkang cheng · Tolga Birdal

We present SignAvatars, the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for Deaf and hard-of-hearing communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the Deaf and hard-of-hearing communities as well as people interacting with them.

# 334

Strong Double Blind

CityGuessr: City-Level Video Geo-Localization on a Global Scale

Parth Parag Kulkarni · Gaurav Kumar Nayak · Shah Mubarak

Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it is coming from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using the image modality. Its video counterpart remains relatively unexplored. Meanwhile, video geolocalization has also garnered some attention in the recent past, but the existing methods are all restricted to specific regions. This motivates us to explore the problem of video geolocalization at a global scale. Hence, we propose a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video. However, no large scale video datasets that have extensive worldwide coverage exist, to train models for solving this problem. To this end, we introduce a new dataset, "CityGuessr68k'' comprising of 68,269 videos from 166 cities all over the world. We also propose a novel baseline approach to this problem, by designing a transformer-based architecture comprising of an elegant "Self-Cross Attention'' module for incorporating scenes as well as a "TextLabel Alignment'' strategy for distilling knowledge from textlabels in feature space. To further enhance our location prediction, we also utilize soft-scene labels. Finally we demonstrate the performance of our method on our new dataset as well as Mapillary(MSLS) dataset. Our complete dataset with code and models will be made publicly available for future use upon publication.

# 84

Strong Double Blind

WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

Yonggan Wu · Ling-Chao Meng · Yuan Zichao · Sixian Chan · Hong-Qiang Wang

For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net's superiority over state-of-the-art methods.

# 74

Strong Double Blind

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Andy V Huynh · Lauren Gillespie · Jael Lopez-Saucedo · Claire Tang · Rohan Sikand · Moisés Expósito-Alonso

Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)—a new pre-training task for ground-level and aerial image representation learning of the natural world—and introduce Nature Multi-View (NMV), a dataset of natural world imagery including >3 million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at hf.co/datasets/andyvhuynh/NatureMultiView.

# 122

Strong Double Blind

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Shixiong Xu · Chenghao Zhang · Lubin Fan · Gaofeng Meng · SHIMING XIANG · Jieping Ye

In this study, we introduce a new problem raised by social media and photojournalism, named Image Address Localization (IAL), which aims to predict the readable textual address where an image was taken. Existing two-stage approaches involve predicting geographical coordinates and converting them into human-readable addresses, which can lead to ambiguity and be resource-intensive. In contrast, we propose an end-to-end framework named AddressCLIP to solve the problem with more semantics, consisting of two key ingredients: i) image-text alignment to align images with addresses and scene captions by contrastive learning, and ii) image-geography matching to constrain image features with the spatial distance in terms of manifold learning. Additionally, we have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem. Experiments demonstrate that our approach achieves compelling performance on the proposed datasets and outperforms representative transfer learning methods for vision-language models. Furthermore, extensive ablations and visualizations exhibit the effectiveness of the proposed method. The datasets and source code are available at https://github.com/xsx1001/AddressCLIP.

# 343

LingoQA: Video Question Answering for Autonomous Driving

Ana-Maria Marcu · Long Chen · Jan Hünermann · Alice Karnsund · Benoit Hanotte · Prajwal Chidananda · Saurabh Nair · Vijay Badrinarayanan · Alex Kendall · Jamie Shotton · Elahe Arani · Oleg Sinavski

We introduce LingoQA, a novel dataset and benchmark for video question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 56.67% of the questions compared to 93.4% for humans. For evaluation, in addition to conducting a human study, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.

# 341

Dolphins: Multimodal Language Model for Driving

Yingzi Ma · Yulong Cao · Jiachen Sun · Marco Pavone · Chaowei Xiao

The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance Dolphins's reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process in the general domain. Then we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are characterized into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.

# 342

Strong Double Blind

PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation

Renjie Lu · Jing-Ke Meng · WEISHI ZHENG

Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. However, they suffer from high computational cost when attempting to support such high-level predictions with GCN-like models. In this work, we propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories, which refers to a path from the initial node to the candidate locations on a directed graph without detours. This planning strategy leads to an efficient model while achieving strong performance. Specifically, we introduce a directed graph to illustrate the explored area of the environment, emphasizing directionality. Then, we firstly define the trajectory representation as a sequence of directed edge features, which are extracted from the panorama based on the corresponding orientation. Ultimately, we assess and compare the alignment between instruction and different trajectories during navigation to determine the next navigation target. Our method outperforms previous SOTA method BEVBert on RxR dataset and is comparable on R2R dataset with only 9% computational cost.

# 344

Strong Double Blind

LLM as Copilot for Coarse-grained Vision-and-Language Navigation

Yanyuan Qiao · Qianyi Liu · Jiajun Liu · Jing Liu · Qi Wu

Vision-and-Language Navigation (VLN) involves guiding an agent through indoor environments using human-provided textual instructions. Coarse-grained VLN, with short and high-level instructions, has gained popularity as it closely mirrors real-world scenarios. However, a significant challenge is these instructions are often too concise for agents to comprehend and act upon. Previous studies have explored allowing agents to seek assistance during navigation, but typically offer rigid support from pre-existing datasets or simulators. The advent of Large Language Models (LLMs) presents a novel avenue for aiding VLN agents. This paper introduces VLN-Copilot, a framework enabling agents to actively seek assistance when encountering confusion, with the LLM serving as a copilot to facilitate navigation. Our approach includes the introduction of a confusion score, quantifying the level of uncertainty in an agent's action decisions, while the LLM offers real-time detailed guidance for navigation. Experimental results on two coarse-grained VLN datasets demonstrate the efficacy of our method.

# 123

Strong Double Blind

Visual Grounding for Object-Level Generalization in Reinforcement Learning

Haobin Jiang · Zongqing Lu

Generalization is a pivotal challenge for agents following natural language instructions. To approach this goal, we leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning (RL) for object-centric tasks, which makes the agent capable of zero-shot generalization to unseen objects and instructions. By visual grounding, we obtain an object-grounded confidence map for the target object indicated in the instruction. Based on this map, we introduce two routes to transfer VLM knowledge into RL. Firstly, we propose an object-grounded intrinsic reward function derived from the confidence map to more effectively guide the agent towards the target object. Secondly, the confidence map offers a more unified, accessible task representation for the agent's policy, compared to language embeddings. This enables the agent to process unseen objects and instructions through comprehensible visual confidence maps, facilitating zero-shot object-level generalization. Single-task experiments prove that our intrinsic reward significantly improves performance on challenging skill learning. In multi-task experiments, through testing on tasks beyond the training set, we show that the agent, when provided with the confidence map as the task representation, possesses better generalization capabilities than language-based conditioning.

# 179

m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Zixian Ma · Weikai Huang · Jieyu Zhang · Tanmay Gupta · Ranjay Krishna

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m’s: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1565 task plans that are human-verified and correctly executable. With m&m’s, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments and provide practical recommendations for designing planners for m&m’s tasks.

# 53

Recursive Visual Programming

Jiaxin Ge · Sanjay Subramanian · Baifeng Shi · Roei Herzig · Trevor Darrell

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods show advancement in leveraging Large Language Models (LLMs) for complex problem-solving. Despite their potential, existing VP methods generate all code in a single function, which does not fully utilize LLM's reasoning capacity and the modular adaptability of code. This results in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which better harnesses the reasoning capacity of LLMs, provides modular code structure between code pieces, and assigns different return types for the sub-problems elegantly. RVP approaches VQA tasks with an top-down recursive code generation approach, allowing decomposition of complicated problems into smaller parts. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks.

# 138

Any2Point: Empowering Any-modality Transformers for Efficient 3D Understanding

YIWEN TANG · Renrui Zhang · Jiaming Liu · Zoey Guo · Bin Zhao · Zhigang Wang · Dong Wang · Peng Gao · Hongsheng LI · Xuelong Li

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method.

# 127

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Zhiyuan You · Zheyuan Li · Jinjin Gu · Zhenfei Yin · Tianfan Xue · Chao Dong

We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans' reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. Our work shows the research potential of multi-modal IQA tasks. Codes and datasets are available in https://depictqa.github.io.

# 134

Strong Double Blind

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Zhecan Wang · Garrett Bingham · Adams Wei Yu · Quoc V. Le · Thang Luong · Golnaz Ghiasi

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36\% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

# 140

Strong Double Blind

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Agneet Chatterjee · Yiran Luo · Tejas Gokhale · Yezhou Yang · Chitta R Baral

The rapid progression of Text-to-Image (T2I) and Multimodal Large Language Models (MLLMs) has resulted in their widespread adoption across multiple computer vision and natural language processing tasks. However, a common mode of failure that persists across both classes of models is their inability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves and evaluates spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 101 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also introduce the REVISION benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an efficient approach for developing reasoning-aware generative models.

# 100

Strong Double Blind

ViG-Bias: Visually Grounded Bias Discovery and Mitigation

Badr-Eddine Marani · Mohamed HANINI · Nihitha Malayarukil · Stergios Christodoulidis · Maria Vakalopoulou · Enzo Ferrante

The proliferation of machine learning models in critical decision making processes has underscored the need for bias discovery and mitigation strategies. Identifying the reasons behind a biased system is not straightforward, since in many occasions they are associated with hidden spurious correlations which are not easy to spot. Standard approaches rely on bias audits performed by analyzing model performance in pre-defined subgroups of data samples, usually characterized by common attributes like gender or ethnicity when it comes to people, or other specific attributes defining semantically coherent groups of images. However, it is not always possible to know a-priori the specific attributes defining the failure modes of visual recognition systems. Recent approaches propose to discover these groups by leveraging large vision language models, which enable the extraction of cross-modal embeddings and the generation of textual descriptions to characterize the subgroups where a certain model is underperforming. In this work, we argue that incorporating visual explanations (e.g. heatmaps generated via GradCAM or other approaches) can boost the performance of such bias discovery and mitigation frameworks. To this end, we introduce Visually Grounded Bias Discovery and Mitigation (ViG-Bias), a simple yet effective technique which can be integrated to a variety of existing frameworks to improve both, discovery and mitigation performance. Our comprehensive evaluation shows that incorporating visual explanations enhances existing techniques like DOMINO, FACTS and Bias-to-Text, across several challenging datasets, including CelebA, Waterbirds, and NICO++.

# 295

GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator

Hengyuan Zhao · Pan Zhou · Mike Zheng Shou

Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data. This paper proposes to explore the potential of empowering MLLMs to generate data independently without relying on GPT-4. We introduce Genixer, a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations.

# 132

Adversarial Prompt Tuning for Vision-Language Models

Jiaming Zhang · Xingjun Ma · Xin Wang · Lingyu Qiu · Jiaqi Wang · Yu-Gang Jiang · Jitao Sang

With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities. However, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. This paper introduces Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs. AdvPT innovatively leverages learnable text prompts and aligns them with adversarial image embeddings, to address the vulnerabilities inherent in VLMs without the need for extensive parameter training or modification of the model architecture. We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques, further boosting defensive capabilities. Comprehensive experimental analyses provide insights into adversarial prompt tuning, a novel paradigm devoted to improving resistance to adversarial images through textual input modifications, paving the way for future robust multimodal learning research. These findings open up new possibilities for enhancing the security of VLMs. Our code will be available upon publication of the paper.

# 136

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Brandon McKinzie · Zhe Gan · Jean-Philippe Fauconnier · Samuel Dodge · Bowen Zhang · Philipp Dufter · Dhruti Shah · Futang Peng · Anton Belyi · Max A Schwarzer · Hongyu Hè · Xianzhi Du · Haotian Zhang · Karanjeet Singh · Doug Kang · Tom Gunter · Xiang Kong · Aonan Zhang · Jianyu Wang · Chong Wang · Nan Du · Tao Lei · Sam Wiseman · Mark Lee · Zirui Wang · Ruoming Pang · Peter Grasch · Alexander Toshev · Yinfei Yang

In this work we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision-language connector, and various pre-training data choices, we identify performant components as well as design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

# 126

Strong Double Blind

Synergy of Sight and Semantics: Visual Intention Understanding with CLIP

Qu Yang · Mang Ye · Dacheng Tao

Multi-label Intention Understanding (MIU) for images is a critical yet challenging domain, primarily due to the ambiguity of intentions leading to a resource-intensive annotation process. Current leading approaches are held back by the limited amount of labeled data. To mitigate the scarcity of annotated data, we leverage the Contrastive Language-Image Pre-training (CLIP) model, renowned for its proficiency in aligning textual and visual modalities. We introduce a novel framework, Intention Understanding with CLIP (IntCLIP), which utilizes a dual-branch approach. This framework exploits the 'Sight'-oriented knowledge inherent in CLIP to augment 'Semantic'-centric MIU tasks. Additionally, we propose Hierarchical Class Integration to effectively manage the complex layered label structure, aligning it with CLIP's nuanced sentence feature extraction capabilities. Our Sight-assisted Aggregation further refines this model by infusing the semantic feature map with essential visual cues, thereby enhancing the intention understanding ability. Through extensive experiments conducted on the standard MIU benchmark and other subjective tasks such as Image Emotion Recognition, IntCLIP clearly demonstrates superiority over current state-of-the-art techniques. The code and models will be released.

# 133

Strong Double Blind

FlexAttention for Efficient High-Resolution Vision-Language Models

Junyan Li · Delin Chen · Tianle Cai · Peihao Chen · Yining Hong · Zhenfang Chen · Yikang Shen · Chuang Gan

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a hierarchical self-attention layer which produces an attention map that could be used for the next-step high-resolution token selection. The hierarchical self-attention process and high-resolution token selection process are performed iteratively for each attention layer. Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while also significantly reducing the computational cost by nearly 40%.

# 137

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Xiangxiang Chu · Jianlin Su · Bo Zhang · Chunhua Shen

Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA family of models stand out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA has exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision understanding and generation.

# 97

Strong Double Blind

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

Kwanyong Park · Kuniaki Saito · Donghyun Kim

Vision-language (VL) models have shown to be very effective in a variety of object detection tasks by utilizing weakly supervised image-text pairs from the web. However, these models exhibit a limited understanding of complex compositions of visual objects (e.g., attributes, shapes, and their relations), resulting in a significant performance drop given complex and diverse language queries. While conventional methods try to enhance VL models through the use of hard negative synthetic augmentation on the text domain, their effectiveness remains restricted without densely paired image-text augmentation. In this paper, we introduce a structured synthetic data generation approach to improve the compositional understanding of VL models for language-based object detection, which generates densely paired positive and negative triplets (object, text descriptions, bounding boxes) in both image and text domains. In addition, in order to train VL models effectively, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.

# 131

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon · Yonatan Bitton · Yonatan Shafir · Roopal Garg · Xi Chen · Dani Lischinski · Danny Cohen-Or · Idan Szpektor

While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks.

# 120

Strong Double Blind

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sara Sarto · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.

# 121

Strong Double Blind

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Shunqi Mao · Chaoyi Zhang · Hang Su · Hwanjun Song · Igor Shalyminov · Weidong Cai

Contextualized Image Captioning (CIC) evolves traditional image captioning into a more complex domain, necessitating the ability for multimodal reasoning. It aims to generate image captions given specific contextual information. This paper further introduces a novel domain of Controllable Contextualized Image Captioning (Ctrl-CIC). Unlike CIC, which solely relies on broad context, Ctrl-CIC accentuates a user-defined highlight, compelling the model to tailor captions that resonate with the highlighted aspects of the context. We present two approaches, Prompting-based Controller (P-Ctrl) and Recalibration-based Controller (R-Ctrl), to generate focused captions. P-Ctrl conditions the model generation on highlight by prepending captions with highlight-driven prefixes, whereas R-Ctrl tunes the model to selectively recalibrate the encoder embeddings for highlighted tokens. Additionally, we design a GPT-4V empowered evaluator to assess the quality of the controlled captions alongside standard assessment methods. Extensive experimental results demonstrate the efficient and effective controllability of our method, charting a new direction in achieving user-adaptive image captioning. Code will be released upon acceptance.

# 124

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

Yichao Cai · Yuhang Liu · Zhen Zhang · Javen Qinfeng Shi

Contrastive vision-language models, such as CLIP, have garnered considerable attention for various dowmsteam tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often blend content and style information, which somewhat limits their generalization capabilities under distribution shifts. To address this limitation, we adopt a causal generative perspective for multimodal data and propose contrastive learning with data augmentation to disentangle content features from the original representations. To achieve this, we begins with exploring image augmentation techniques and develop a method to seamlessly integrate them into pre-trained CLIP-like models to extract pure content features. Taking a step further, recognizing the inherent semantic richness and logical structure of text data, we explore the use of text augmentation to isolate latent content from style features. This enables CLIP-like model's encoders to concentrate on latent content information, refining the learned representations by pre-trained CLIP-like models. Our extensive experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks, alongside enhanced robustness to various perturbations. These results underscore the effectiveness of our proposed methods in refining vision-language representations and advancing the state-of-the-art in multimodal learning.

# 130

Strong Double Blind

Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning

Mainak Singha · Ankit Jha · Divyam Gupta · Pranav Singla · Biplab Banerjee

We address the challenges inherent in sketch-based image retrieval (SBIR) across various settings, including zero-shot SBIR, generalized zero-shot SBIR, and fine-grained zero-shot SBIR, by leveraging the vision-language foundation model, CLIP. While recent endeavors have employed CLIP to enhance SBIR, these approaches predominantly follow uni-modal prompt processing and overlook to fully exploit CLIP's integrated visual and textual capabilities. To bridge this gap, we introduce \textsc{SpLIP}, a novel multi-modal prompt learning scheme designed to operate effectively with frozen CLIP backbones. We diverge from existing multi-modal prompting methods that either treat visual and textual prompts independently or integrate them in a limited fashion, leading to suboptimal generalization. \textsc{SpLIP} implements a bi-directional prompt-sharing strategy that enables mutual knowledge exchange between CLIP's visual and textual encoders, fostering a more cohesive and synergistic prompt processing mechanism that significantly reduces the semantic gap between the sketch and photo embeddings. In addition to pioneering multi-modal prompt learning, we propose two innovative strategies for further refining the embedding space. The first is an adaptive margin generation for the sketch-photo triplet loss, regulated by CLIP's class textual embeddings. The second introduces a novel task, termed conditional cross-modal jigsaw, aimed at enhancing fine-grained sketch-photo alignment, by focusing on implicitly modelling the viable patch arrangement of sketches using knowledge of unshuffled photos. Our comprehensive experimental evaluations across multiple benchmarks demonstrate the superior performance of \textsc{SpLIP} in all three SBIR scenarios.

# 113

Strong Double Blind

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

Xianyu Chen · Ming Jiang · Qi Zhao

While exploring visual scenes, humans' scanpaths are driven by their underlying attention processes. Understanding visual scanpaths is essential for various applications. Traditional scanpath models predict the where and when of gaze shifts without providing explanations, creating a gap in understanding the rationale behind fixations. To bridge this gap, we introduce GazeXplain, a novel study of visual scanpath prediction and explanation. This involves annotating natural-language explanations for fixations across eye-tracking datasets and proposing a general model with an attention-language decoder that jointly predicts scanpaths and generates explanations. It integrates a unique semantic alignment mechanism to enhance the consistency between fixations and explanations, alongside a cross-dataset co-training approach for generalization. These novelties present a comprehensive and adaptable solution for explainable human visual scanpath prediction. Extensive experiments on diverse eye-tracking datasets demonstrate the effectiveness of GazeXplain in both scanpath prediction and explanation, offering valuable insights into human visual attention and cognitive processes.

# 80

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

Haiyang Zheng · Pu Nan · Wenjing Li · Niculae Sebe · Zhun Zhong

In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class aligning strategy to ensure the alignment of category perceptions between modalities as well as a soft-voting mechanism to integrate multi-modality cues. Experiments on eight datasets show the large superiority of our approach over state-of-the-art methods. Notably, our approach outperforms the best competitor, by 7.7% and 10.8% in All accuracy on ImageNet-1k and CUB, respectively.

# 102

Strong Double Blind

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Zhengbo Zhang · Li Xu · Duo Peng · Hossein Rahmani · Jun Liu

We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target's movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.

# 323

Trackastra: Transformer-based cell tracking for live-cell microscopy

Benjamin Gallusser · Weigert Martin

Cell tracking is an omnipresent image analysis task in live-cell microscopy. It is similar to multiple object tracking (MOT), however, each frame contains hundreds of similar-looking objects that can divide, making it a challenging problem. Current state-of-the-art approaches follow the tracking-by-detection paradigm, i.e. first all cells are detected per frame and successively linked in a second step to form biologically consistent cell tracks. Linking is commonly solved via discrete optimisation methods, which require manual tuning of hyperparameters for each dataset and are therefore cumbersome to use in practice. Here we propose Trackastra, a general purpose cell tracking approach that uses a simple transformer architecture to directly learn pairwise associations of cells within a temporal window from annotated data. Importantly, unlike existing transformer-based MOT pipelines, our learning architecture also accounts for dividing objects such as cells and allows for accurate tracking even with simple greedy linking, thus removing the requirement for a complex linking step. The proposed architecture operates on the full spatio-temporal context of detections within a time window by avoiding the computational burden of processing dense images. We show that our tracking approach performs on par with or better than highly tuned state-of-the-art cell tracking algorithms for various biological datasets, such as bacteria, cell cultures and fluorescent particles. We provide code at https://github.com/*.

# 164

Strong Double Blind

Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking

Lorenzo Vaquero · Yihong XU · Xavier Alameda-Pineda · Victor M. Brea · Manuel Mucientes

Multi-object tracking (MOT) endeavors to precisely estimate the positions and identities of multiple objects over time. The prevailing approach, tracking-by-detection (TbD), first detects objects and then links detections, resulting in a simple yet effective method. However, contemporary detectors may occasionally miss some objects in certain frames, causing trackers to cease tracking prematurely. To tackle this issue, we propose BUSCA, meaning 'to search', a versatile framework compatible with any online TbD system, enhancing its ability to persistently track those objects missed by the detector, primarily due to occlusions. Remarkably, this is accomplished without modifying past tracking results or accessing future frames, i.e., in a fully online manner. BUSCA generates proposals based on neighboring tracks, motion, and learned tokens. Utilizing a decision Transformer that integrates multimodal visual and spatiotemporal information, it addresses the object-proposal association as a multi-choice question-answering task. BUSCA is trained independently of the underlying tracker, solely on synthetic data, without requiring fine-tuning. Through BUSCA, we showcase consistent performance enhancements across five different trackers and establish a new state-of-the-art baseline across three different benchmarks. We will publicly release the code to facilitate its integration into future MOT methods.

# 163

Strong Double Blind

Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs

Mattia Segu · Luigi Piccinelli · Siyuan Li · Luc Van Gool · Fisher Yu · Bernt Schiele

The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.

# 172

Strong Double Blind

E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation

Shengxuming Zhang · Lei Jin · Yifan Wang · Xinyu Wang · Xu Wen · Zunlei Feng · Mingli Song

Accurately estimating energy expenditure (EE) is crucial for optimizing athletic training, monitoring daily activity levels, and preventing sports-related injuries. Estimating energy expenditure based on video (E$^\mathit{3}$V) is an appealing research direction. This paper introduces E3V-K5, an authentic dataset of sports videos that significantly enhances the accuracy of EE estimation. The dataset comprises 16,526 video clips from various categories and intensity of sports with continuous calorie readings obtained from the COSMED K5 indirect calorimeter, recognized as the most reliable standard in sports research. Augmented with the heart rate and physical attributes of each subject, the volume, diversity, and authenticity of E3V-K5 surpass all previous video datasets in E$^\mathit{3}$V, making E3V-K5 a cornerstone in this field and facilitating future research. Furthermore, we propose E3SFormer, a novel approach specifically designed for the E3V-K5 dataset, focusing on EE estimation using human skeleton data. E3SFormer consists of two Transformer branches for simultaneous action recognition and EE regression. The attention of joints from the action recognition branch is utilized in assisting the EE regression branch. Extensive experimentation validates E3SFormer's effectiveness, demonstrating its superior performance to existing skeleton-based action recognition models. Our dataset and code will be publicly accessible.

# 171

Strong Double Blind

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Mingfang Zhang · Yifei Huang · Ruicong Liu · Yoichi Sato

Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representation via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.

# 168

Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion

Quoc-Huy Tran · Muhammad Ahmed · Murad Popattia · Muhammad Hassan Ahmed · Andrey Konin · Zeeshan Zia

This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton coordinates are taken directly as input, our key idea is to use sequences of 2D skeleton heatmaps as input. Unlike CASA which performs self-attention in the temporal domain only, we feed 2D skeleton heatmaps to a video transformer which performs self-attention both in the spatial and temporal domains for extracting effective spatiotemporal and contextual features. In addition, we introduce simple heatmap augmentation techniques based on 2D skeletons for self-supervised learning. Despite the lack of 3D information, our approach achieves not only higher accuracy but also better robustness against missing and noisy keypoints than CASA. Furthermore, extensive evaluations on three public datasets, i.e., Penn Action, IKEA ASM, and H2O, demonstrate that our approach outperforms previous methods in different fine-grained human activity understanding tasks. Finally, fusing 2D skeleton heatmaps with RGB videos yields the state-of-the-art on all metrics and datasets. To our best knowledge, our work is the first to utilize 2D skeleton heatmap inputs and the first to explore multi-modality fusion for temporal video alignment.

# 338

Strong Double Blind

Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective

Panjian Huang · Yunjie Peng · Saihui Hou · Chunshui Cao · Xu Liu · Zhiqiang He · Yongzhen Huang

Extensive occlusions in real-world scenarios pose challenges to gait recognition due to missing and noisy information, as well as body misalignment in position and scale. We argue that rich dynamic contextual information within a gait sequence inherently possesses occlusion-solving traits: 1) Adjacent frames with gait continuity allow holistic body regions to infer occluded body regions; 2) Gait cycles allow information integration between holistic actions and occluded actions. Therefore, we introduce an action detection perspective where a gait sequence is regarded as a composition of actions. To detect accurate actions under complex occlusion scenarios, we propose an Action Detection Based Mixture of Experts (GaitMoE), consisting of Mixture of Temporal Experts (MTE) and Mixture of Action Experts (MAE). MTE adaptively constructs action anchors by temporal experts and MAE adaptively constructs action proposals from action anchors by action experts. Especially, action detection as a proxy task with gait recognition is an end-to-end joint training only with ID labels. In addition, due to the lack of a unified occluded benchmark, we construct a pioneering Occluded Gait database (OccGait), containing rich occlusion scenarios and annotations of occlusion types. Extensive experiments on OccGait, OccCASIA-B, Gait3D and GREW demonstrate the superior performance of GaitMoE. OccGait is available at https://github.com/BNU-IVC/OccGait.

# 336

Strong Double Blind

Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization

Mengnan Liu · Le Wang · Sanping Zhou · Kun Xia · Qi Wu · Qilin Zhang · Gang Hua

Point-supervised temporal action localization pursues high-accuracy action detection under low-cost data annotation. Despite recent advances, a significant challenge remains: sparse labeling of individual frames leads to semantic ambiguity in determining action boundaries due to the lack of continuity in the highly sparse point-supervision scheme. We propose a Stepwise Multi-grained Boundary Detector(SMBD), which is comprised of a Background Anchor Generator(BAG) and a Dual Boundary Detector(DBD) to provide fine-grained supervision. Specifically, for each epoch in the training process, BAG computes the optimal background snippet between each pair of adjacent action labels, which we term Background Anchor. Subsequently, DBD leverages the background anchor and the action labels to locate the action boundaries from the perspectives of detecting action changes and scene changes. Then, the corresponding labels can be assigned to each side of the boundaries, with the boundaries continuously updated throughout the training process. Consequently, the proposed SMBD could ensure that more snippets contribute to the training process. Extensive experiments on the THUMOS’14, GTEA and BEOID datasets demonstrate that the proposed method outperforms existing state-of-the-art methods.

# 43

Strong Double Blind

Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras

Hoonhee Cho · Sung-Hoon Yoon · Hyeokjun Kweon · Kuk-Jin Yoon

Event cameras excel in capturing high-contrast scenes and dynamic objects, offering a significant advantage over traditional frame-based cameras. Despite active research into leveraging event cameras for semantic segmentation, generating pixel-wise dense semantic maps for such challenging scenarios remains labor-intensive. As a remedy, we present EV-WSSS: a novel weakly supervised approach for event-based semantic segmentation that utilizes sparse point annotations. To fully leverage the temporal characteristics of event data, the proposed framework performs asymmetric dual-student learning between 1) the original forward event data and 2) the longer reversed event data, which contain complementary information from the past and the future, respectively. Besides, to mitigate the challenges posed by sparse supervision, we propose feature-level contrastive learning based on class-wise prototypes, carefully aggregated at both spatial region and sample levels. Additionally, we further excavate the potential of our dual-student learning model by exchanging prototypes between the two learning paths, thereby harnessing their complementary strengths. With extensive experiments on various datasets, including DSEC Night-Point with sparse point annotations newly provided by this paper, the proposed method achieves substantial segmentation results even without relying on pixel-level dense ground truths. The code and dataset will be published soon.

# 165

Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation

Haozhi Cao · Yuecong Xu · Jianfei Yang · Pengyu Yin · Xingyu Ji · Shenghai Yuan · Lihua Xie

Multi-modal test-time adaptation (MM-TTA) is proposed to adapt models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. Previous MM-TTA methods rely on predictions of cross-modal information in each input frame, while they ignore the fact that predictions of geometric neighborhoods within consecutive frames are highly correlated, leading to unstable predictions across time. To fulfill this gap, we propose ReLiable Spatial-temporal Voxels (Latte), an MM-TTA method that leverages reliable cross-modal spatial-temporal correspondences for multi-modal 3D segmentation. Motivated by the fact that reliable predictions should be consistent with their spatial-temporal correspondences, Latte aggregates consecutive frames in a slide window manner and constructs ST voxel to capture temporally local prediction consistency for each modality. After filtering out ST voxels with high ST entropy, Latte conducts cross-modal learning for each point and pixel by attending to those with reliable and consistent predictions among both spatial and temporal neighborhoods. Experimental results show that Latte achieves state-of-the-art performance on three different MM-TTA benchmarks compared to previous MM-TTA or TTA methods.

# 175

X-Pose: Detecting Any Keypoints

Jie Yang · AILING ZENG · Ruimao Zhang · Lei Zhang

This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their two-stage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose the X-Pose, a novel end-to-end framework with multi-modal (i.e., visual, textual or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called UniKPT, which unifies 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances. Training with UniKPT, X-Pose effectively aligns text-to-keypoint and image-to-keypoint due to the mutual enhancement of multi-modal prompts based on cross-modality contrastive learning. Our experimental results demonstrate that X-Pose achieves notable improvements of 27.7 AP, 6.44 PCK and 7.0 AP compared to state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods in each respective fair setting. More importantly, the in-the-wild test demonstrates X-Pose's strong fine-grained keypoint localization and generalization abilities across image styles, object categories, and poses, paving a new path to mutli-object keypoint detection in real applications.

# 77

Open-Set Recognition in the Age of Vision-Language Models

Dimity Miller · Niko Suenderhauf · Alex Kenna · Keita Mason

Are vision-language models (VLMs) open-set models because they are trained on internet-scale datasets? We answer this question with a clear no -- VLMs introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions. We systematically evaluate VLMs for open-set recognition and find they frequently misclassify objects not contained in their query set, leading to alarmingly low precision when tuned for high recall and vice versa. We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance. We establish a revised definition of the open-set problem for the age of VLMs, define a new benchmark and evaluation protocol to facilitate standardised evaluation and research in this important area, and evaluate promising baseline approaches based on predictive uncertainty and dedicated negative embeddings on a range of VLM classifiers and object detectors.

# 72

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu · Zhaoyang Zeng · Tianhe Ren · Feng Li · Hao Zhang · Jie Yang · Qing Jiang · Chunyuan Li · Jianwei Yang · Hang Su · Jun Zhu · Lei Zhang

In this paper, we develop an open-set object detector called Grounding DINO by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well in all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data in COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP.

# 191

Strong Double Blind

A Fair Ranking and New Model for Panoptic Scene Graph Generation

Julian Lorenz · Alexander Pest · Daniel Kienzle · Katja Ludwig · Rainer Lienhart

In panoptic scene graph generation (PSGG), models retrieve interactions between objects in an image which are grounded by panoptic segmentation masks. Previous evaluations on panoptic scene graphs have been subject to an erroneous evaluation protocol where multiple masks for the same object can lead to multiple relation distributions per mask-mask pair. This can be exploited to increase the final score. We correct this flaw and provide a fair ranking over a wide range of existing PSGG models. The observed scores for existing methods increase by up to 7.4 mR@50 for all two-stage methods, while dropping by up to 19.3 mR@50 for all one-stage methods, highlighting the importance of a correct evaluation. Contrary to recent publications, we show that existing two-stage methods are competitive to one-stage methods. Building on this, we introduce the Decoupled SceneFormer (DSFormer), a novel two-stage model that outperforms all existing scene graph models by a large margin of +11 mR@50 and +10 mNgR@50 on the corrected evaluation, thus setting a new SOTA. As a core design principle, DSFormer encodes subject and object masks directly into feature space.

# 73

Strong Double Blind

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao · Na Zhao · Jingjing Chen · Yu-Gang Jiang

Open-vocabulary 3D object detection (OV-3DDET) is a challenging task aimed at locating and recognizing objects in a 3D scene, encompassing both seen and previously unseen categories. Unlike in the vision and language domain where abundant training data is available to train generalized models, 3D detection models suffer from the scarcity of training data. Despite this challenge, the flourishing field of vision-language models (VLMs) offers valuable insights that can guide the learning process for OV-3DDET. While some efforts have been made to incorporate VLMs into OV-3DDET learning, existing methods often fall short in establishing a comprehensive association between 3D detectors and VLMs. In this paper, we investigate the utilization of VLMs for the task of open-vocabulary 3D detection. We use a vision model to guide novel class discovery in 3D scenes, and hierarchically align the 3D feature and vision-language feature space. Specifically, we employ an off-the-shelf 2D detector to seed and select novel 3D objects. The discovered novel objects are then stored for retraining the 3D detector. Finally, we align the 3D feature space with the vision-language feature space using a pre-trained vision-language model at the instance, category, and scene levels. Through extensive experimentation, we demonstrate substantial improvements in accuracy and generalization, underscoring the potential of VLMs in advancing 3D object detection for real-world applications.

# 125

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

Liting Lin · Heng Fan · Zhipeng Zhang · Yaowei Wang · Yong Xu · Haibin Ling

Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of larger Vision Transformers (ViT) for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. However, unique challenges and potential domain gaps make this transfer not as easy as the first intuition. Firstly, a transformer-based tracker constructs unshared position embedding for template and search image. This poses a challenge for the transfer of LoRA, usually requiring consistency in the design when applied to the pre-trained backbone, to downstream tasks. Secondly, the inductive bias inherent in convolutional heads diminishes the effectiveness of parameter-efficient fine-tuning in tracking models. To overcome these limitations, we first decouple the position embeddings in transformer-based trackers into shared spatial ones and independent type ones. The shared embeddings, which describe the absolute coordinates of multi-resolution images (namely, the template and search images), are inherited from the pre-trained backbones. In contrast, the independent embeddings indicate the sources of each token and are learned from scratch. Furthermore, we design an anchor-free head solely based on a multilayer perceptron (MLP) to adapt PETR, enabling better performance with less computational overhead. With our design, 1) it becomes practical to train trackers with the ViT-g backbone on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve the LaSOT SUC score from 0.703 to 0.743 with the L-224 variant; 4) we fast the inference speed of the L-224 variant from 52 to 119 FPS. Code and models will be released.

# 78

Strong Double Blind

A Simple Background Augmentation Method for Object Detection with Diffusion Model

YUHANG LI · Xin Dong · Chen Chen · Weiming Zhuang · Lingjuan Lyu

In computer vision, it is well-known that a lack of data diversity will impair model performance. In this study, we address the challenges of enhancing the dataset diversity problem in order to benefit various downstream tasks such as object detection and instance segmentation. We propose a simple yet effective data augmentation approach by leveraging advancements in generative models, specifically text-to-image synthesis technologies like Stable Diffusion. Our method focuses on generating variations of labeled real images, utilizing generative object and background augmentation via inpainting to augment existing training data without the need for additional annotations. We find that background augmentation, in particular, significantly improves the models' robustness and generalization capabilities. We also investigate how to adjust the prompt and mask to ensure the generated content do not violate the existing annotations. The efficacy of our augmentation techniques is validated through comprehensive evaluations of the COCO dataset and several other key object detection benchmarks, demonstrating notable enhancements in model performance across diverse scenarios. This approach offers a promising solution to the challenges of dataset enhancement, contributing to the development of more accurate and robust computer vision models.

# 67

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Zhening Huang · Xiaoyang Wu · Xi Chen · Hengshuang ZHAO · Lei Zhu · Joan Lasenby

In this work, we introduce OpenIns3D, a new framework for 3D open-vocabulary scene understanding at instance level. Unlike all existing methods, the proposed pipeline requires no well-aligned images as input and works effectively on a wide range of scenarios. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme, where the "Mask" module learns class-agnostic mask proposals in 3D point clouds, the "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects, and the "Lookup" module searches through the outcomes of "Snap" to assign category names to the proposed masks. This approach, free from 2D input requirements yet simple and flexible, achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks, including recognition, object detection, and instance segmentation, on both indoor and outdoor datasets. Moreover, OpenIns3D facilitates effortless switching between different 2D detectors without requiring retraining. When integrated with powerful 2D open-world models, it achieves excellent results in scene understanding tasks. Furthermore, when combined with LLM-powered 2D models, OpenIns3D exhibits a remarkable capability to comprehend and process highly complex text queries that demand intricate reasoning and real-world knowledge.

# 85

Strong Double Blind

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Xiaoyu Zhu · Hao Zhou · Pengfei Xing · Long Zhao · Hao Xu · Junwei Liang · Alexander G. Hauptmann · Ting Liu · Andrew Gallagher

Traditional 3D scene understanding techniques rely on supervised learning from densely annotated 3D datasets. However, the collection and annotation of 3D data is expensive and tedious, which leads to the scarcity of labeled training data. In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D scene understanding. We propose a novel method, namely Diff2Scene, leverages frozen representations from text-image discriminative and generative models, along with salient-aware and geometric-aware masks, for open-vocabulary scene understanding. Diff2Scene gets rid of any labeled 3D data and effectively identifies objects, appearances, materials, locations and their compositions in 3D scenes using a single model. We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods in open-vocabulary 3D semantic segmentation tasks. In particular, Diff2Scene improves the state-of-the-art method on ScanNet200 by 12%.

# 93

Agent Attention: On the Integration of Softmax and Linear Attention

Dongchen Han · Tianzhu Ye · Yizeng Han · Xia Zhuofan · Siyuan Pan · Pengfei Wan · Shiji Song · Gao Huang

The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple (Q, A, K, V), introduces an additional set of agent tokens A into the conventional attention module. The agent tokens first act as the agent for the query tokens Q to aggregate information from K and V, and then broadcast the information back to Q. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code will be released.

# 115

Strong Double Blind

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

Jingjing Wu · Zhengyao Fang · Pengyuan Lyu · Chengquan Zhang · Fanglin Chen · Guangming Lu · Wenjie Pei

Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.

# 48

Strong Double Blind

Agglomerative Token Clustering

Joakim Bruslund Haurum · Sergio Escalera · Graham W. Taylor · Thomas B. Moeslund

We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par as the prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.

# 79

Strong Double Blind

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

Hyunwoo Yu · Yubin Cho · Beoungwoo Kang · Seunghun Moon · Kyeongbo Kong · Suk-Ju Kang

We present an Encoder-Decoder Attention Transformer, EDAFormer, which consists of the Embedding-Free Transformer (EFT) encoder and the all-attention decoder leveraging our Embedding-Free Attention (EFA) structure. The proposed EFA is a novel global context modeling mechanism that focuses on functioning the global non-linearity, not the specific roles of the query, key and value. For the decoder, we explore the optimized structure for considering the globality, which can improve the semantic segmentation performance. In addition, we propose a novel Inference Spatial Reduction (ISR) method for the computational efficiency. Different from the previous spatial reduction attention methods, our ISR method reduces the key-value resolution at the inference, which can reduce the computation-performance trade-off gap for the efficient semantic segmentation. Our EDAFormer shows the state-of-the-art performance with the efficient computation compared to the existing transformer-based semantic segmentation models in three public benchmarks, including ADE20K, Cityscapes and COCO-Stuff. Furthermore, our ISR method reduces the computational cost by up to 61% with minimal mIoU performance degradation on Cityscapes dataset.

# 70

Strong Double Blind

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Xiaoxu Xu · Yitian Yuan · Jinlong Li · Qiudan Zhang · Zequn Jie · Lin Ma · Hao Tang · Niculae Sebe · Xu Wang

3D weakly supervised semantic segmentation aims to learn semantic segmentation without using dense annotations. Previous methods mostly use Class Activation Map to solve this challenge. In such a paradigm, the model is supervised given the scene-level or subcloud-level labels, however, remaining less-explored in the potential textually semantic information from the category labels. In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-language models and proposes Embeddings Soft-Guidance Stage to utilize it to implicitly align 3D embeddings and text embeddings. Moreover, we introduce Embeddings Specialization Stage to purify the feature representation with the help of given scene-level label, specifying better feature supervised by the corresponding text embedding. Thus, the 3D model is able to gain the informative supervisions both from the image embedding and text embedding, leading to competitive segmentation performances. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels. Moreover, with extensive quantitative and qualitative experiments, we present that our 3DSS-VLG is able to not only achieve the state-of-the-art performance on both S3DIS and ScanNet dataset, but also maintain strong generalization capability.

# 60

Strong Double Blind

SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images

josh myers-dean · Jarek T Reynolds · Brian Price · Yifei Fan · Danna Gurari

Hierarchical segmentation entails creating segmentations at varying levels of granularity. We introduce the first hierarchical semantic segmentation dataset with subpart annotations for natural images, which we call SPIN (SubPartImageNet). We also introduce two novel evaluation metrics to evaluate how well algorithms capture spatial and semantic relationships across hierarchical levels. We benchmark modern models across three different tasks and analyze their strengths and weaknesses across objects, parts, and subparts. To facilitate community-wide progress, we publicly release our dataset at https://joshmyersdean.github.io/spin/index.html.

# 59

Strong Double Blind

Open-Vocabulary RGB-Thermal Semantic Segmentation

Guoqiang Zhao · JunJie Huang · Xiaoyun Yan · Zhaojing Wang · Junwei Tang · Yangjun Ou · Xinrong Hu · Tao Peng

RGB-Thermal (RGB-T) semantic segmentation is an important research branch of multi-modal image segmentation. The current RGB-T semantic segmentation methods generally have two unsolved and typical shortcomings. First, they do not have the open-vocabulary recognition ability, which significantly limits their application scenarios. Second, when fusing RGB and thermal images, they often need to design complex fusion network structures, which usually results in low network training efficiency. We present OpenRSS, the Open-vocabulary RGB-T Semantic Segmentation method, to solve these two disadvantages. To our knowledge, OpenRSS is the first RGB-T semantic segmentation method with open-vocabulary segmentation capability. OpenRSS modifies the basic segmentation model SAM for RGB-T semantic segmentation by adding the proposed thermal information prompt module and dynamic low-rank adaptation strategy to SAM. These designs effectively fuse the RGB and thermal information, but with much fewer trainable parameters than other methods. OpenRSS achieves the open-vocabulary capability by jointly utilizing the vision-language model CLIP and the modified SAM. Through extensive experiments, OpenRSS demonstrates its effective open-vocabulary semantic segmentation ability on RGB-T images. It outperforms other state-of-the-art RGB open-vocabulary semantic segmentation methods on multiple RGB-T semantic segmentation benchmarks: +12.1% mIoU on the MFNet dataset, +18.4% mIoU on the MCubeS dataset, and +21.4% mIoU on the Freiburg Thermal dataset.

# 63

PartSTAD: 2D-to-3D Part Segmentation Task Adaptation

Hyunjin Kim · Minhyuk Sung

We introduce PartSTAD, a method designed for the task adaptation of 2D-to-3D segmentation lifting. Recent studies have highlighted the advantages of utilizing 2D segmentation models to achieve high-quality 3D segmentation through few-shot adaptation. However, previous approaches have focused on adapting 2D segmentation models for domain shift to rendered images and synthetic text descriptions, rather than optimizing the model specifically for 3D segmentation. Our proposed task adaptation method finetunes a 2D bounding box prediction model with an objective function for 3D segmentation. We introduce weights for 2D bounding boxes for adaptive merging and learn the weights using a small additional neural network. Additionally, we incorporate SAM, a foreground segmentation model on a bounding box, to improve the boundaries of 2D segments and consequently those of 3D segmentation. Our experiments on the PartNet-Mobility dataset show significant improvements with our task adaptation approach, achieving a 7.0%p increase in mIoU and a 5.2%p improvement in mAP@50 for semantic and instance segmentation compared to the SotA few-shot 3D segmentation model. The code will be released publicly later.

# 55

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

Haobo Yuan · Xiangtai Li · Chong Zhou · Yining Li · Kai Chen · Chen Change Loy

The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes. The code and models will be publicly available for further research.

# 54

Strong Double Blind

FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions

Sohyun Lee · Namyup Kim · Sungyeon Kim · Suha Kwak

Robust semantic segmentation under adverse conditions is of great importance in real-world applications. To address this challenging task in practical scenarios where labeled normal condition images are not accessible in training, we propose FREST, a novel feature restoration framework for source-free domain adaptation (SFDA) of semantic segmentation to adverse conditions. FREST alternates two steps: (1) learning the condition embedding space that only separates the condition information from the features and (2) restoring features of adverse condition images on the learned condition embedding space. By alternating these two steps, FREST gradually restores features where the effect of adverse conditions is reduced. FREST achieved a state of the art on two public benchmarks (i.e., ACDC and RobotCar) for SFDA under adverse conditions. Moreover, it shows superior generalization ability on unseen datasets.

# 57

Strong Double Blind

Progressive Proxy Anchor Propagation for Unsupervised Semantic Segmentation

Hyun Seok Seong · WonJun Moon · SuBeen Lee · Jae-Pil Heo

The labor-intensive labeling for semantic segmentation has spurred the emergence of Unsupervised Semantic Segmentation. Recent studies utilize patch-wise contrastive learning based on features from image-level self-supervised pretrained models. However, relying solely on similarity-based supervision from image-level pretrained models often leads to unreliable guidance due to insufficient patch-level semantic representations. To address this, we propose Progressive Proxy Anchor Propagation (PPAP) strategy. This method gradually identifies more trustworthy positives of each anchor by relocating its proxy to densely populated regions of semantically similar samples. Specifically, we initially establish a tight boundary to gather a few reliable positive samples around each anchor. Then, considering the distribution of positive samples, we relocate the proxy anchor towards areas with a higher concentration of positives and adjust the positiveness boundary based on the propagation degree of the proxy anchor. In addition, there might exist ambiguous regions where positive and negative samples coexist near the positiveness boundary. Therefore, to further ensure the reliability of the negative set, we define an instance-wise ambiguous zone and exclude samples in such regions from the negative set. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for Unsupervised Semantic Segmentation.

# 31

Strong Double Blind

Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation

Zhengyuan Xie · Haiquan Lu · Jia-wen Xiao · Enguang Wang · Le Zhang · Xialei Liu

Class incremental semantic segmentation aims to preserve old knowledge while learning new tasks, however, impeded by catastrophic forgetting and background shift issues. Prior works indicate the pivotal importance of initializing new classifiers and mainly focus on transferring knowledge from the background classifier or preparing classifiers for future classes, neglecting the alignment and variance of new classifiers. In this paper, we propose a new classifier pre-tuning (NeST) method applied before the formal training process, learning a transformation from old classifiers to generate new classifiers for initialization rather than directly tuning the parameters of new classifiers. Our method can make new classifiers align with the backbone and adapt the new data, benefiting both the stability and plasticity of the model. Besides, we design a strategy considering the cross-task class similarity to initialize matrices used in the transformation. Experiments on Pascal VOC 2012 and ADE20K datasets show that the proposed strategy can significantly improve the performance of previous methods.

# 56

Strong Double Blind

Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off

Levente Ferenc Halmosi · Bálint Mohos · Márk Jelasity

Machine learning models are vulnerable to tiny adversarial input perturbations optimized to cause a very large output error. To measure this vulnerability, we need reliable methods that can find such adversarial perturbations. For image classification models, evaluation methodologies have emerged that have stood the test of time. However, we argue that in the area of semantic segmentation, a good approximation of the sensitivity to adversarial perturbations requires significantly more effort than what is currently considered satisfactory. To support this claim, we re-evaluate a number of well-known robust segmentation models in an extensive empirical study. We propose new attacks and combine them with the strongest attacks available in the literature. We also analyze the sensitivity of the models in fine detail. The results indicate that most of the state-of-the-art models have a dramatically larger sensitivity to adversarial perturbations than previously reported. We also demonstrate a size-bias: small objects are often more easily attacked, even if the large objects are robust, a phenomenon not revealed by current evaluation metrics. Our results also demonstrate that a diverse set of strong attacks is necessary, because different models are often vulnerable to different attacks.

# 61

Strong Double Blind

Pseudo-Embedding for Generalized Few-Shot Point Cloud Segmentation

Chih-Jung Tsai · Hwann-Tzong Chen · Tyng-Luh Liu

Existing generalized few-shot 3D segmentation (GFS3DS) methods typically prioritize enhancing the training of base-class prototypes while neglecting the rich semantic information within background regions for future novel classes. We introduce a novel GFS3DS learner that strategically leverages background context to improve both base prototype training and few-shot adaptability. Our method employs foundation models to extract semantic features from background points and grounds on text embeddings to cluster background points into pseudo-classes. This approach facilitates clearer base/novel class differentiation and generates pseudo prototypes that effectively mimic novel support samples. Comprehensive experiments on S3DIS and ScanNet datasets demonstrate the state-of-the-art performance of our method in both 1-shot and 5-shot tasks. Our approach significantly advances GFS3DS by unlocking the potential of background context, offering a promising avenue for broader applications. The GitHub repository of our implementation will be released upon publication.

# 52

Self-supervised co-salient object detection via feature correspondences at multiple scales

Souradeep Chakraborty · Dimitris Samaras

Our paper introduces a novel two-stage self-supervised approach for detecting co-occurring salient objects (CoSOD) in image groups without requiring segmentation annotations. Unlike existing unsupervised methods that rely solely on patch-level information (e.g. clustering patch descriptors) or on computation heavy off-the-shelf components for CoSOD, our lightweight model leverages feature correspondences at both patch and region levels, significantly improving prediction performance. In the first stage, we train a self-supervised network that detects co-salient regions by computing local patch-level feature correspondences across images. We obtain the segmentation predictions using confidence-based adaptive thresholding. In the next stage, we refine these intermediate segmentations by eliminating the detected regions (within each image) whose averaged feature representations are dissimilar to the foreground feature representation averaged across all the cross-attention maps (from the previous stage). Extensive experiments on three CoSOD benchmark datasets show that our self-supervised model outperforms the corresponding state-of-the-art models by a huge margin (e.g. on the CoCA dataset, our model has a 13.7% F-measure gain over the SOTA unsupervised CoSOD model). Notably, our self-supervised model also outperforms several recent fully supervised CoSOD models on the three test datasets (\eg, on the CoCA dataset, our model has a 4.6% F-measure gain over a recent supervised CoSOD model).

# 45

Strong Double Blind

Unsupervised Dense Prediction using Differentiable Normalized Cuts

Yanbin Liu · Stephen Gould

With the emergent attentive property of self-supervised Vision Transformer (ViT), Normalized Cuts (NCut) has resurfaced as a powerful tool for unsupervised dense prediction. However, the pre-trained ViT backbone (e.g., DINO) is frozen in existing methods, which makes the feature extractor suboptimal for dense prediction tasks. In this paper, we propose using Differentiable Normalized Cuts for self-supervised dense feature learning that can improve the dense prediction capability of existing pre-trained models. First, we review an efficient gradient formulation for the classical NCut algorithm. This formulation only leverages matrices computed and stored in the forward pass, making the backward pass highly efficient. Second, with NCut gradients in hand, we design a self-supervised dense feature learning architecture to finetune pre-trained models. Given two random augmented crops of an image, the architecture performs RoIAlign and NCut to generate two foreground masks of their overlapping region. Last, we propose a mask-consistency loss to back-propagate through NCut and RoIAlign for model training. Experiments show that our framework generalizes to various pre-training methods (DINO, MoCo and MAE), network configurations (ResNet, ViT-S and ViT-B), and tasks (unsupervised saliency detection, object discovery and semantic correspondence). Moreover, we achieved state-of-the-art results on unsupervised dense prediction benchmarks.

# 37

Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM

Jia Wan · qiangqiang wu · Wei Lin · Antoni Chan

The existing crowd counting models require extensive training data, which is time-consuming to annotate. To tackle this issue, we propose a simple yet effective crowd counting method by utilizing the Segment-Everything-Everywhere Model (SEEM), an adaptation of the Segmentation Anything Model (SAM), to generate pseudo-labels for training crowd counting models. However, our initial investigation reveals that SEEM's performance in dense crowd scenes is limited, primarily due to the omission of many persons in high-density areas. To overcome this limitation, we propose an adaptive resolution SEEM to handle the scale variations, occlusions, and overlapping of people within crowd scenes. Alongside this, we introduce a robust localization method, based on Gaussian Mixture Models, for predicting the head positions in the predicted people masks. Given the mask and point pseudo-labels, we propose a robust loss function, which is designed to exclude uncertain regions based on SEEM's predictions, thereby enhancing the training process of the counting networks. Finally, we propose an iterative method for generating pseudo-labels. This method aims at improving the quality of the segmentation masks by identifying more tiny persons in high-density regions, which are often missed in the first pseudo-labeling stage. Overall, our proposed method achieves the best unsupervised performance in crowd counting, while also being comparable results to some supervised methods. This makes it a highly effective and versatile tool for crowd counting, especially in situations where labeled data is not available.

# 34

Strong Double Blind

Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Zhi Qin Tan · Olga Isupova · Gustavo Carneiro · Xiatian Zhu · Yunpeng Li

Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise, especially in crowdsourcing scenarios. Most prior object detection methods assume accurate annotations; A few recent works have studied object detection with noisy crowdsourced annotations, with evaluation on distinct synthetic crowdsourced datasets of varying setups under artificial assumptions. To address these algorithmic limitations and evaluation inconsistency, we first propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations, with the unique ability of automatically inferring the annotators' label qualities. Unlike previous approaches, BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models. Due to the scarcity of real-world crowdsourced datasets, we introduce large synthetic datasets by simulating varying crowdsourcing scenarios. This allows consistent evaluation of different models at scale. Extensive experiments on both real and synthetic crowdsourced datasets show that BDC outperforms existing state-of-the-art methods, demonstrating its superiority in leveraging crowdsourced data for object detection.

# 29

Strong Double Blind

Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection

QIJIE MO · Yipeng Gao · Shenghao Fu · Junkai Yan · Ancong Wu · WEISHI ZHENG

In incremental object detection, knowledge distillation has been proven to be an effective way to alleviate catastrophic forgetting. However, previous works focused on preserving the knowledge of old models, ignoring that images could simultaneously contain categories from past, present, and future stages. The co-occurrence of objects makes the optimization objectives inconsistent across different stages since the definition for foreground objects differs across various stages, greatly limiting the model's performance. To overcome this problem, we propose a method called ``Bridge Past and Future'' (BPF), which aligns models across stages, ensuring consistent optimization directions. In addition, we propose a novel Distillation with Future (DwF) loss, fully leveraging the background probability to mitigate the forgetting of old classes while ensuring a high level of adaptability in learning new classes. Extensive experiments are conducted on both Pascal VOC and MS COCO benchmarks. Without memory, BPF outperforms current state-of-the-art methods under various settings.

# 39

Strong Double Blind

Bucketed Ranking-based Losses for Efficient Training of Object Detectors

Feyza Yavuz · Baris Can Cam · Adnan Harun Dogan · Kemal Oksuz · Emre Akbas · Sinan Kalkan

Ranking-based loss functions, such as Average Precision Loss and Rank\&Sort Loss, outperform widely used score-based losses in object detection. These loss functions better align with the evaluation criteria, have fewer hyperparameters, and offer robustness against the imbalance between positive and negative classes. However, they require pairwise comparisons among $P$ positive and $N$ negative predictions, introducing a time complexity of $\mathcal{O}(PN)$, which is prohibitive since $N$ is often large (e.g., $10^8$ in ATSS). Despite their advantages, the widespread adoption of ranking-based losses has been hindered by their high time and space complexities. In this paper, we focus on improving the efficiency of ranking-based loss functions. To this end, we propose Bucketed Ranking-based Losses which group negative predictions into $B$ buckets ($B \ll N$) in order to reduce the number of pairwise comparisons so that time complexity can be reduced. Our method enhances the time complexity, reducing it to $\mathcal{O}(\max (N \log(N), P^2))$. To validate our method and show its generality, we conducted experiments on 2 different tasks, 3 different datasets, 7 different detectors. We show that Bucketed Ranking-based (BR) Losses yield the same accuracy with the unbucketed versions and provide $2\times$ faster training on average. We also train, for the first time, transformer-based object detectors using ranking-based losses, thanks to the efficiency of our BR. When we train CoDETR, a state-of-the-art transformer-based object detector, using our BR Loss, we consistently outperform its original results over several different backbones. Code will be released.

# 33

Strong Double Blind

Better Regression Makes Better Test-time Adaptive 3D Object Detection

Jiakang Yuan · Bo Zhang · Kaixiong Gong · Xiangyu Yue · Botian Shi · Yu Qiao · Tao Chen

Domain Adaptation (DA) has been widely explored and made significant progress on cross-domain 3D tasks recently. Despite being effective, existing works fail to deal with rapidly changing domains due to the unpredictable test time scenarios and meanwhile fast response time requirement. Thus, we explore a new task named test-time domain adaptive 3D object detection and propose RegTTA3D, a pseudo-label-based test-time adaptative 3D object detection method. By investigating the factor that limits the detection accuracy, we find that regression is essential in this task. To make better regression, we first design a noise-consistency pseudo-label generation process to filter pseudo-labels with instability under noise interference and obtain reliable pseudo-labels. Then, confidence-guided regression refinement is introduced, which uses the box regression results of high-confidence boxes to supervise boxes with relatively low confidence, further making the predicted box size gradually approach the distribution of the target domain. Finally, to better update the regression layer and alleviate the class-imbalance issue, a class-balance EMA updating strategy is proposed. Experimental results on multiple cross-domain scenarios including cross-beam, crosslocation, and cross-weather demonstrate that Reg-TTA3D can achieve comparable or even better performance compared to unsupervised domain adaptation works by only updating less than 0.1% parameters within less than 1% time. Codes will be public.

# 21

Strong Double Blind

MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection

Ziyue Huang · Yongchao Feng · Qingjie Liu · Yunhong Wang

Detection pre-training methods for the DETR series detector have been extensively studied in natural scenes, i.e., DETReg. However, the detection pre-training remains unexplored in remote sensing scenes. In existing pre-training methods, alignment between object embeddings extracted from a pre-trained backbone and detector features is significant. However, due to differences in feature extraction ways, a pronounced feature discrepancy still exists and hinders the pre-training performance. The remote sensing images with complex environments and more densely distributed objects exacerbate the discrepancy. In this work, we propose a novel Mutually optimizing pre-training framework for remote sensing object Detection, dubbed as MutDet. In MutDet, we propose a systemic solution against this challenge. Firstly, we propose a mutual enhancement module, which fuses the object embeddings and detector features bidirectionally in the last encoder layer, enhancing their information interaction. Secondly, contrastive alignment loss is employed to guide this alignment process softly and simultaneously enhances detector features' discriminativity. Finally, we design an auxiliary siamese head to mitigate the task gap arising from the introduction of enhancement module. Comprehensive experiments on various settings show new state-of-the-art transfer performance. The improvement is particularly pronounced when data quantity is limited. When using 10 % of the DIOR-R data, MutDet improves DetReg by 6.1 % in AP_50. The code and models will be released.

# 58

Strong Double Blind

IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection

Mingjin Zhang · Yuchun Wang · Jie Guo · Yunsong Li · Xinbo Gao · Jing Zhang

The recent Segment Anything Model (SAM) is a significant advancement in natural image segmentation, exhibiting potent zero-shot performance suitable for various downstream image segmentation tasks. However, directly utilizing the pretrained SAM for Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared images. Unlike a visible light camera, a thermal imager reveals an object's temperature distribution by capturing infrared radiation. Small targets often show a subtle temperature transition at the object's boundaries. To address this issue, we propose the IRSAM model for IRSTD, which improves SAM's encoder-decoder architecture to learn better feature representation of infrared small objects. Specifically, we design a Perona-Malik diffusion (PMD)-based block and incorporate it into multiple levels of SAM's encoder to help it capture essential structural features while suppressing noise. Additionally, we devise a Granularity-Aware Decoder (GAD) to fuse the multi-granularity feature from the encoder to capture structural information that may be lost in long-distance modeling. Extensive experiments on the public datasets, including NUAA-SIRST, NUDT-SIRST, and IRSTD-1K, validate the design choice of IRSAM and its significant superiority over representative state-of-the-art methods. The source code will be released.

# 50

Semi-supervised Segmentation of Histopathology Images with Noise-Aware Topological Consistency

Meilong Xu · Xiaoling Hu · Saumya Gupta · Shahira Abousamra · Chao Chen

In digital pathology, segmenting densely distributed objects like glands and nuclei is crucial for downstream analysis. Since detailed pixel-wise annotations are very time-consuming, we need semi-supervised segmentation methods that can learn from unlabeled images. Existing semi-supervised methods are often prone to topological errors, e.g., missing or incorrectly merged/separated glands or nuclei. To address this issue, we propose TopoSemiSeg, the first semi-supervised method that learns the topological representation from unlabeled histopathology images. The major challenge is for unlabeled images; we only have predictions carrying noisy topology. To this end, we introduce a noise-aware topological consistency loss to align the representations of a teacher and a student model. By decomposing the topology of the prediction into signal topology and noisy topology, we ensure that the models learn the true topological signals and become robust to noise. Extensive experiments on public histopathology image datasets show the superiority of our method, especially on topology-aware evaluation metrics.

# 51

Strong Double Blind

The Devil is in the Statistics: Mitigating and Exploiting Statistics Difference for Generalizable Semi-supervised Medical Image Segmentation

Muyang Qiu · Jian Zhang · Lei Qi · Qian Yu · Yinghuan Shi · Yang Gao

Despite the recent success of domain generalization in medical image segmentation, it is known that precisely conducting voxel-wise annotation for all available source domains remains a huge burden. To combat this challenge, as a new setting, semi-supervised domain generalization has been proposed very recently by leveraging limited labeled data along with abundant unlabeled data collected from multiple medical institutions to enhance model generalization. To achieve promising results in this setting, correctly harnessing unlabeled data while improving the generalization ability to unseen domains plays a critical role simultaneously. In this work, we observe that the domain shifts between medical institutions (i.e., source domains) cause disparate feature statistics, which significantly deteriorates the pseudo-label quality due to an unexpected normalization process. Nevertheless, this phenomenon also could be exploited to facilitate unseen domain generalization. Therefore, to fully exploit the potential of statistics difference while mitigating its negative impacts, we in this paper propose 1) multiple statistics-individual branches to reduce the interference of domain shifts between source domains for reliable pseudo-labels and 2) one statistics-aggregated branch for domain-invariant feature learning. Furthermore, to simulate unseen domains with the statistics difference, we approach this from two aspects, i.e., a perturbation with histogram matching from the image level and a random batch normalization selection strategy from the feature level, producing diverse statistics to expand the training distribution. Evaluation results on the Prostate, Fundus, and M&Ms datasets demonstrate the effectiveness of our method compared with recent SOTA methods. The code will be available in Supplementary Materials.

# 49

Strong Double Blind

A Rotation-invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images

Tianyi Liu · Shuaishuai S Zhuang · Jiacheng Nie · Geng Chen · Yusheng Guo · Guangquan Zhou · Jean-Louis Coatrieux · Yang Chen

Endoscopic Ultrasound (EUS) is advantageous in perceiving hierarchical changes in the esophageal tract wall for diagnosing submucosal tumors. However, the lesions often disrupt the structural integrity and fine-grained texture information of the esophageal layer, impeding the accurate diagnosis. Moreover, the lesions can appear in any radial position due to the characteristics of EUS imaging, further increasing the difficulty of diagnosis. In this study, we advance an automatic classification model by equipping the Vision Transformer (ViT), a recent state-of-the-art model, with a novel statistical rotation-invariant reinforcement mechanism dubbed SRRM-ViT. Mainly, we adaptively select crucial regions to avoid interference from irrelevant information in the image. Also, this model integrates histogram statistical features with rotation invariance into the self-attention mechanism, achieving bias-free capture of fine-grained information of lesions at arbitrary radial positions. Validated by in-house clinical data and public data, SRRM-ViT has demonstrated remarkable performance improvements, which demonstrates the efficacy and potential of our approach in EUS image classification. Keywords: Fine-Grained Visual Classification (FGVC), Endoscopic Ultrasound (EUS), Rotation Invariant, Token Selection.

# 47

Strong Double Blind

Multistain Pretraining for Slide Representation Learning in Pathology

Guillaume Jaume · Anurag J Vaidya · Andrew Zhang · Andrew Song · Richard J Chen · Sharifa Sahai · Dandan Mo · Emilio Madrigal · Long P Le · Faisal Mahmood

Developing self-supervised learning (SSL) models that can learn universal and transferable representations of H&E gigapixel whole-slide images (WSIs) is becoming increasingly valuable in computational pathology. These models hold the potential to advance critical tasks such as few-shot classification, slide retrieval, and patient stratification. Existing approaches for slide representation learning extend the principles of SSL from small images (e.g., 224 x 224 patches) to entire slides, usually by aligning two different augmentations (or \emph{views}) of the slide. Yet the resulting representation remains constrained by the limited clinical and biological relevance of the views. Instead, we postulate that slides stained with multiple markers, such as immunohistochemistry or special stains, can be seen as different views of the same tissue and can constitute a rich task-agnostic training signal. To this end, we introduce MADELEINE, a multimodal pretraining strategy for slide representation learning. MADELEINE is trained with a dual global-local cross-stain alignment objective on large cohorts of breast cancer samples (N=4,211 WSIs across five stains) and kidney transplant samples (N=12,070 WSIs across four stains). We demonstrate the superior quality of slide representations learned by MADELEINE on various downstream evaluations, ranging from morphological and molecular classification to prognostic prediction, on a total of 21 tasks using 7,299 WSIs from multiple medical centers. Code will be released upon acceptance.

# 46

Strong Double Blind

Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data

Zhengfeng Lai · Joohi Chauhan · Brittany N. Dugger · Chen-Nee Chuah

Contrastive Language-Image Pre-training (CLIP) has shown its proficiency in acquiring distinctive visual representations and exhibiting strong generalization across diverse vision tasks. However, its effectiveness in pathology image analysis, particularly with limited labeled data, remains an ongoing area of investigation due to challenges associated with significant domain shifts and catastrophic forgetting. Thus, it is imperative to devise efficient adaptation strategies in this domain to enable scalable analysis. In this study, we introduce Path-CLIP, a framework tailored for a swift adaptation of CLIP to various pathology tasks. Firstly, we propose Residual Feature Refinement (RFR) with a dynamically adjustable ratio to effectively integrate and balance source and task-specific knowledge. Secondly, we introduce Hidden Representation Perturbation (HRP) and Dual-view Vision Contrastive (DVC) techniques to mitigate overfitting issues. Finally, we present the Doublet Multimodal Contrastive Loss (DMCL) for fine-tuning CLIP for pathology tasks. We demonstrate that Path-CLIP adeptly adapts pre-trained CLIP to downstream pathology tasks, yielding competitive results. Specifically, Path-CLIP achieves over +19% improvement in accuracy when utilizing mere 0.1% of labeled data in PCam with only 10 minutes of fine-tuning while running on a single GPU.

# 40

Strong Double Blind

HERGen: Elevating Radiology Report Generation with Longitudinal Data

Fuying Wang · Shenghui Du · Lequan Yu

Radiology reports provide detailed descriptions of medical imaging integrated with patients' medical histories, while report writing is traditionally labor-intensive, increasing radiologists' workload and the risk of diagnostic errors. Recent efforts in automating this process seek to mitigate these issues by enhancing accuracy and clinical efficiency. Emerging research in automating this process promises to alleviate these challenges by reducing errors and streamlining clinical workflows. However, existing automated approaches are based on a single timestamp and often neglect the critical temporal aspect of patients' imaging histories, which is essential for accurate longitudinal analysis. To address this gap, we propose a novel History Enhanced Radiology Report Generation (HERGen) framework that employs a employs a group causal transformer to efficiently integrate longitudinal data across patient visits. Our approach not only allows for comprehensive analysis of varied historical data but also improves the quality of generated reports through an auxiliary contrastive objective that aligns image sequences with their corresponding reports. More importantly, we introduce a curriculum learning-based strategy to adeptly handle the inherent complexity of longitudinal radiology data and thus stabilize the optimization of our framework. The extensive evaluations across three datasets demonstrate that our framework surpasses existing methods in generating accurate radiology reports and effectively predicting disease progression from medical images.

# 197

Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics

Shuai Yang · ZhiFei Chen · Pengguang Chen · Xi Fang · Yixun Liang · Shu Liu · Yingcong Chen

Defect inspection is paramount within the closed-loop manufacturing system. However, existing datasets for defect inspection often lack the precision and semantic granularity required for practical applications. In this paper, we introduce the Defect Spectrum, a comprehensive benchmark that offers precise, semantic-abundant, and large-scale annotations for a wide range of industrial defects. Building on four key industrial benchmarks, our dataset refines existing annotations and introduces rich semantic details, distinguishing multiple defect types within a single image. With our dataset, we were able to achieve an increase of 10.74% in the Recall rate, and a decrease of 33.10% in the False Positive Rate (FPR) from the industrial simulation experiment. Furthermore, we introduce Defect-Gen, a two-stage diffusion-based generator designed to create high-quality and diverse defective images, even when working with limited defective data. The synthetic images generated by Defect-Gen significantly enhance the performance of defect segmentation models, achieving an improvement in mIoU scores up to 9.85 on Defect-Spectrum subsets. Overall, The Defect Spectrum dataset demonstrates its potential in defect inspection research, offering a solid platform for testing and refining advanced models. Our codes and datasets are released in https://envision-research.github.io/Defect_Spectrum.

# 36

Strong Double Blind

Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis

Brian Isaac Medina · Yona Falinie Abdul Gaus · Neelanjan Bhowmik · Toby P Breckon

Object detection is a pivotal task in computer vision, focused upon localising and categorising objects within the distribution for which they are trained. Nonetheless, the capability of an object detector to localise objects out of the training distribution remains largely unexplored. Whilst recent approaches in object-level out-of-distribution (OoD) detection heavily rely on class-wise labels, such approaches contradict truly open-world scenarios where the number of classes is often unknown. In this context, anomaly detection focuses on detecting unseen instances rather than identifying an object as OoD. This work aims to bridge this gap by leveraging an open-world object detector in conjunction with a self-supervised OoD detector via virtual outlier synthesis. This is achieved by using the detector backbone features to first learn object pseudo-classes in an unsupervised manner. Subsequently, these pseudo-classes serve as the basis for the class-conditional virtual outlier sampling of anomalous features that are classified by an OoD head. Our approach empowers our overall object detector architecture to learn anomaly-aware feature representations without relying on class labels, hence enabling truly open-world object anomaly detection. Empirical validation of our approach demonstrates its effectiveness across diverse datasets encompassing various imaging modalities (visible, infrared, and X-ray). Moreover, our method establishes state-of-the-art performance on object-level anomaly detection, achieving an average recall score improvement of over 5.4% for natural images and 23.5% for a security X-ray dataset compared to the current approaches. In addition, our method can detect anomalies in datasets where current approaches fail. Code is available at .

# 38

Strong Double Blind

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

Yunkang Cao · Jiangning Zhang · Luca Frittoli · Yuqi Cheng · Weiming Shen · Giacomo Boracchi

Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are proposed: \textit{static} and \textit{dynamic}. Static prompts are shared across all images, serving to preliminarily adapt CLIP for ZSAD. In contrast, dynamic prompts are generated for each test image, providing CLIP with dynamic adaptation capabilities. The combination of static and dynamic prompts is referred to as hybrid prompts, and yields enhanced ZSAD performance. Extensive experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods and can generalize better to different categories and even domains. Finally, our analysis highlights the importance of diverse auxiliary data and optimized prompts for enhanced generalization capacity. Code is available at \texttt{removed for blind review}.

# 35

Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection

Xincheng Yao · Ruoqi Li · Zefeng Qian · lu wang · Chongyang Zhang

Unified anomaly detection (AD) is one of the most challenges for anomaly detection, where one unified model is trained with normal samples from multiple classes with the objective to detect anomalies in these classes. For such a challenging task, popular normalizing flow (NF) based AD methods may fall into a homogeneous mapping'' issue, where the NF-based AD models are biased to generate similar latent representations for both normal and abnormal features, and thereby lead to a high missing rate of anomalies. In this paper, we propose a novel Hierarchical Gaussian mixture normalizing flow modeling method for accomplishing unified Anomaly Detection, which we call HGAD. Our HGAD consists of two key components: inter-class Gaussian mixture modeling and intra-class mixed class centers learning. Compared to the previous NF-based AD methods, the hierarchical Gaussian mixture modeling approach can bring stronger representation capability to the latent space of normalizing flows, so that even complex multi-class distribution can be well represented and learned in the latent space. In this way, we can avoid mapping different class distributions into the same single Gaussian prior, thus effectively avoiding or mitigating thehomogeneous mapping'' issue. We further indicate that the more distinguishable different class centers, the more conducive to avoiding the bias issue. Thus, we further propose a mutual information maximization loss for better structuring the latent feature space. We evaluate our method on four real-world AD benchmarks, where we can significantly improve the previous NF-based AD methods and also outperform the SOTA unified AD methods.

# 90

Strong Double Blind

A Unified Image Compression Method for Human Perception and Multiple Vision Tasks

Sha Guo · Sui Lin · Chen-Lin Zhang · Zhuo Chen · Wenhan Yang · Lingyu Duan

Recent advancements in end-to-end image compression demonstrate the potential to surpass traditional codecs regarding rate-distortion performance. However, current methods either prioritize human perceptual quality or solely optimize for one or a few predetermined downstream tasks, neglecting a more common scenario that involves a variety of unforeseen machine vision tasks. In this paper, we propose a Diffusion-Based Multiple-Task Unified Image Compression framework that aims to expand the boundary of traditional image compression by incorporating human perception and multiple vision tasks in open-set scenarios. Our proposed method comprises a Multi-task Collaborative Embedding module and a Diffusion-based Invariant Knowledge Learning module. The former module facilitates collaborative embedding for multiple tasks, while the latter module distills the invariant knowledge from seen tasks to generalize toward unseen machine vision tasks. Experiments and visualizations show that the proposed method effectively extracts compact and versatile embeddings for human and machine vision collaborative compression tasks, resulting in superior performance. Specifically, our method outperforms the state-of-the-art by 52.25%/51.68%/48.87%/48.07%/6.29% BD-rate reduction in terms of mAP/mAP/aAcc/PQ-all/accuracy on the MS-COCO for instance detection/instance segmentation/stuff segmentation/panoptic segmentation and video question answering tasks, respectively.

# 144

FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion

Xiaofeng Wu · Velibor Bojkovic · Bin Gu · Kun Suo · Kai Zou

Spiking Neural Networks (SNNs) offer a promising avenue for energy-efficient computing compared with Artificial Neural Networks (ANNs), closely mirroring biological neural processes. However, this potential comes with inherent challenges in directly training SNNs through spatio-temporal backpropagation --- stemming from the temporal dynamics of spiking neurons and their discrete signal processing --- which necessitates alternative ways of training, most notably through ANN-SNN conversion. In this work, we introduce a lightweight Forward Temporal Bias Correction (FTBC) technique, aimed at enhancing conversion accuracy without the computational overhead. We ground our method on provided theoretical findings that through proper temporal bias calibration the expected error of ANN-SNN conversion can be reduced to be zero after each time step. We further propose a heuristic algorithm for finding the temporal bias only in the forward pass, thus eliminating the computational burden of backpropagation and we evaluate our method on CIFAR-10/100 and ImageNet datasets, achieving a notable increase in accuracy on all datasets. Codes are released at a GitHub repository.

# 3

Strong Double Blind

Quantization-Friendly Winograd Transformations for Convolutional Neural Networks

Vladimir Protsenko · Vladimir Kryzhanovskiy · Alexander Filippov

Efficient deployment of modern deep convolutional neural networks on resource-constrained devices suffers from demanding computational requirements of convolution operations. Quantization and use of Winograd convolutions operating on sufficiently large-tile inputs are two powerful strategies to speed up convolution operations. However, their combination results in numerical instability, which manifests itself in a strong quality performance degradation. We present an efficient learning scenario that either completely overcomes or strongly reduces the accuracy degradation of full 8-bit quantized F(4, 3) and F(6, 3) Winograd convolutions. Within the global particle swarm optimization (PSO), we derived a set of quantization-friendly Winograd transformations. Following the state-of-the-art (SOTA) training pipeline, we treat Winograd transformations as learnable parameters during network training. Evolving transformations starting from our PSO-derived ones rather than the standard Winograd transformations results in significant numerical error reduction and accuracy improvement. As a consequence, our approach significantly outperforms SOTA methods on various tasks.

# 5

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Chien-Yao Wang · I-Hau Yeh · Hong-Yuan Mark Liao

Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN's architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets.

# 92

Strong Double Blind

Stripe Observation Guided Inference Cost-free Attention Mechanism

Zhongzhan Huang · Shanshan Zhong · Wushao Wen · Jinghui Qin · Liang Lin

Structural re-parameterization (SRP) is a novel technique series that boosts neural networks without introducing any computational costs in inference stage. The existing SRP methods have successfully considered many architectures, such as normalizations, convolutions, etc. However, the widely used but computationally expensive attention modules cannot be directly implemented by SRP due to the inherent multiplicative manner and the modules' output is input-dependent during inference. In this paper, we statistically discover a counter-intuitive phenomenon \textit{Stripe Observation} in various settings, which reveals that channel attention values consistently approach some constant vectors during training. It inspires us to propose a novel attention-alike SRP, called ASR, that allows us to achieve SRP for a given network while enjoying the effectiveness of the attention mechanism. Extensive experiments conducted on several standard benchmarks show the effectiveness of ASR in generally improving the performance of various scenarios without any elaborated model crafting. We also provide experimental and theoretical evidence for how the proposed ASR can enhance model performance.

# 44

NOVUM: Neural Object Volumes for Robust Object Classification

Artur Jesslen · Guofeng Zhang · Angtian Wang · Wufei Ma · Alan Yuille · Adam Kortylewski

Discriminative models for object classification typically learn image-based representations that do not capture the compositional and 3D nature of objects. In this work, we show that explicitly integrating 3D compositional object representations into deep networks for image classification leads to a largely enhanced generalization in out-of-distribution scenarios. In particular, we introduce a novel architecture, referred to as NOVUM, that consists of a feature extractor and a neural object volume for every target object class. Each neural object volume is a composition of 3D Gaussians that emit feature vectors. This compositional object representation allows for a highly robust and fast estimation of the object class by independently matching the features of the 3D Gaussians of each category to features extracted from an input image. Additionally, the object pose can be estimated via inverse rendering of the corresponding neural object volume. To enable the classification of objects, the neural features at each 3D Gaussian are trained discriminatively to be distinct from (i) the features of 3D Gaussians in other categories, (ii) features of other 3D Gaussians of the same object, and (iii) the background features. Our experiments show that NOVUM offers intriguing advantages over standard architectures due to the 3D compositional structure of the object representation, namely: (1) An exceptional robustness across a spectrum of real-world and synthetic out-of-distribution shifts and (2) an enhanced human interpretability compared to standard models, all while maintaining real-time inference and a competitive accuracy on in-distribution data.

# 24

Strong Double Blind

POA: Pre-training Once for Models of All Sizes

Yingying Zhang · Xin Guo · Jiangwei Lao · Lei Yu · Lixiang Ru · Jian Wang · Guo Ye · HUIMEI HE · Jingdong Chen · Ming Yang

Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes for downstream tasks. Remarkably, the elastic student facilitates the simultaneous pre-training of multiple models with different sizes, which also acts as an additional ensemble of models of various sizes to enhance representation learning. Extensive experiments, including k-nearest neighbors, linear probing evaluation and assessments on downstream tasks demonstrate the effectiveness and advantages of our POA. It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones, producing around a hundred models with different sizes through a single pre-training session. We will release the code and pre-trained models of POA to share with community in the future.

# 19

Strong Double Blind

Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks

Cheng Gong · Yao Chen · Qiuyang Luo · Ye Lu · Tao Li · Yuzhi Zhang · Yufei Sun · Le Zhang

Multi-exit network is a promising architecture for efficient model inference by sharing backbone networks and weights among multiple exits. However, the gradient conflict of the shared weights results in sub-optimal accuracy. This paper introduces Deep Feature Surgery (DFS), which consists of feature partitioning and feature referencing approaches to resolve gradient conflict issues during the training of multi-exit networks. The feature partitioning separates shared features along the depth axis among all exits to alleviate gradient conflict while simultaneously promoting joint optimization for each exit. Subsequently, feature referencing enhances multi-scale features for distinct exits across varying depths to improve the model accuracy. Furthermore, DFS reduces the training operations with the reduced complexity of backpropagation. Experimental results on Cifar100 and ImageNet datasets exhibit that DFS provides up to a 50.00% reduction in training time and attains up to a 6.94% enhancement in accuracy when contrasted with baseline methods across diverse models and tasks. Budgeted batch classification evaluation on MSDNet demonstrates that DFS uses about 2x fewer average FLOPs per image to achieve the same classification accuracy as baseline methods on Cifar100.

# 20

Strong Double Blind

Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization

Jiajun Hu · Jian Zhang · Lei Qi · Yinghuan Shi · Yang Gao

Domain generalization (DG) aims to avoid the performance degradation of the model when the distribution shift between the limited training data and unseen test data occurs. Recently, foundation models with enormous parameters have been pre-trained with huge datasets, demonstrating strong generalization ability and showing promising direction for solving the DG problem. However, fully Fine-Tuning (FT) the foundation models results in unsatisfactory out-of-distribution accuracy due to the destroyed pre-trained generalized features. Recently, Parameter-Efficient Fine-Tuning (PEFT) alleviates the above problem by fine-tuning a small portion of the model parameters while keeping the rest frozen, which achieves better generalization performance compared to FT. Nevertheless, PEFT still suffers from the issue of overfitting to the training domains. To address the above issue, we propose Parameter-Efficient Group with Orthogonal regularization (PEGO) for vision transformers, which effectively preserves the generalization ability of the pre-trained network and learns more diverse knowledge compared with conventional PEFT. Specifically, we inject a group of trainable Low-Rank Adaptation (LoRA) modules into the pre-trained model and propose an orthogonal regularization loss to enhance the generalization ability of the model. Our framework achieves SOTA performance on five DG benchmarks, while only requiring training a small number of parameters without adding additional testing cost.

# 12

Strong Double Blind

Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning

Wenke Huang · Mang Ye · zekun shi · Bo Du · Dacheng Tao

Federated learning presents massive potential for privacy-friendly vision task collaboration. However, the federated visual performance is deeply affected by backdoor attacks, where malicious clients optimize on triggered samples to mislead the global model into targeted mispredictions. Existing backdoor defensive solutions are normally based on two assumptions: data homogeneity and minority malicious ratio for the elaborate client-wise defensive rules. To address existing limitations, we argue that heterogeneous clients and backdoor attackers both bring divergent optimization directions and thus it is hard to discriminate them precisely. In this paper, we argue that parameters appear in different important degrees towards distinct distribution and instead consider meaningful and meaningless parameters for the ideal target distribution. We propose the Self-Driven Fisher Calibration (SDFC), which utilizes the Fisher Information to calculate the parameter importance degree for the local agnostic and global validation distribution and regulate those elements with large important differences. Furthermore, we allocate high aggregation weight for clients with relatively small overall parameter differences, which encourages clients with close local distribution to the global distribution, to contribute more to the federation. This endows SDFC to handle backdoor attackers in heterogeneous federated learning. Various vision task performances demonstrate the effectiveness of SDFC.

# 83

MultiDelete for Multimodal Machine Unlearning

Jiali Cheng · Hadi Amiri

Machine Unlearning removes specific knowledge about training data samples corresponding effects from an already trained model. It has significant practical benefits, such as purging private, inaccurate, or outdated information from trained models without the need for complete re-training. Unlearning within a multimodal setting presents unique challenges due to the intrinsic ependencies between different data modalities and the expensive cost of training on large multimodal datasets and architectures. This paper presents the first machine unlearning approach for multimodal data and models, titled MultiDelete, which is designed to decouple associations between unimodal data points during unlearning without losing the overall representation strength of the trained model. MultiDelete advocates for three key properties for effective multimodal unlearning: (a): modality decoupling, which effectively decouples the association between individual unimodal data points marked for deletion, rendering them as unrelated data points, (b): multimodal knowledge retention, which retains the multimodal representation capability of the model post-unlearning, and (c): unimodal knowledge retention, which retains the unimodal representation capability of the model post-unlearning. MultiDelete is efficient to train and is not constrained by using a strongly convex loss--a common restriction among many existing baselines. Experiments on two multimodal architectures and four datasets, including image-text and graph-text datasets, show that MultiDelete gains an average improvement of 17.6 points over best performing baseline in unlearning multimodal training samples, can maintain the multimodal and unimodal knowledge of the original model post unlearning, can provide better protection to unlearned data, and is robust against adversarial attacks

# 69

Strong Double Blind

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

Ioannis Maniadis Metaxas · Georgios Tzimiropoulos · ioannis Patras

Self-supervised learning has recently emerged as the preeminent pretraining paradigm across and between modalities, with remarkable results. In the image domain specifically, group (or cluster) discrimination has been one of the most successful methods. However, such frameworks need to guard against heavily imbalanced cluster assignments to prevent collapse to trivial solutions. Existing works typically solve this by reweighing cluster assignments to promote balance, or with offline operations (e.g. regular re-clustering) that prevent collapse. However, the former typically requires large batch sizes, which leads to increased resource requirements, and the latter introduces scalability issues with regard to large datasets. In this work, we propose ExCB, a framework that tackles this problem with a novel cluster balancing method. ExCB estimates the relative size of the clusters across batches and balances them by adjusting cluster assignments, proportionately to their relative size and in an online manner. Thereby, it overcomes previous methods' dependence on large batch sizes and is fully online, and therefore scalable to any dataset. We conduct extensive experiments to evaluate our approach and demonstrate that ExCB: a) achieves state of the art results with significantly reduced resource requirements compared to previous works, b) is fully online, and therefore scalable to large datasets, and c) is stable and effective even with very small batch sizes.

# 81

Strong Double Blind

Multi-Label Cluster Discrimination for Visual Representation Learning

Xiang An · Kaicheng Yang · Xiangzi Dai · Ziyong Feng · Jiankang Deng

Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by CLIP can hardly encode the semantic structure of training data. To handle this limitation, cluster discrimination has been proposed through iterative cluster assignment and classification. Nevertheless, most cluster discrimination approaches only define a single pseudo-label for each image, neglecting multi-label signals in the image. In this paper, we propose a novel Multi-Label Cluster Discrimination method named MLCD to enhance representation learning. In the clustering step, we first cluster the large-scale LAION-400M dataset into one million centers based on off-the-shelf embedding features. Considering that natural images frequently contain multiple visual objects or attributes, we select the multiple closest centers as auxiliary class labels. In the discrimination step, we design a novel multi-label classification loss, which elegantly separates losses from positive classes and negative classes, and alleviates ambiguity on decision boundary. We validate the proposed multi-label cluster discrimination method with experiments on different scales of models and pre-training datasets. Experimental results show that our method achieves state-of-the-art performance on multiple downstream tasks including linear probe, zero-shot classification, and image-text retrieval.

# 71

Strong Double Blind

Robustness Preserving Fine-tuning using Neuron Importance

Guangrui Li · Rahul Duggal · Aaditya Singh · Kaustav Kundu · Bing Shuai · Jonathan Wu

Robust fine-tuning aims to adapt a vision-language model to downstream tasks while preserving its zero-shot capabilities on unseen data. Recent studies have introduced fine-tuning strategies to improve in-distribution (ID) performance on the downstream tasks while minimizing deterioration in out-of-distribution (OOD) performance on unseen data. This balance is achieved either by aligning the fine-tuned representations with the pre-trained ones or by constraining significant deviations in fine-tuned weights compared to the pre-trained model. In the latter approach, the regularization term is uniformly applied to all parameters. Our work proposes to selectively apply the regularization term based on the ``importance'' of each neuron to the fine-tuning dataset. To this end, we develop an importance-score metric to quantify each neurons’ importance to the downstream task and then leverage this to develop two fine-tuning strategies: importance-guided selective fine-tuning and importance-guided regularization. Our approach can be used concurrently with representation space-based methods, outperforming other approaches based on parameter space. We improve the state-of-the-art on standard robust fine-tuning benchmarks across datasets in both the full-shot and low-shot settings.

# 82

Strong Double Blind

Online Zero-Shot Classification with CLIP

Qi Qian · JUHUA HU

Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed problem preserves its flexibility for online service but considers the statistics of the arrived images as the side information to capture the distribution of target data better. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space can be further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) can trade between bias and variance in predictions, which helps achieve $78.94\%$ accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than $3\%$ improvement on average, which demonstrates the effectiveness of our proposal.

# 96

Strong Double Blind

Understanding Multi-compositional learning in Vision and Language models via Category Theory

Sotirios Panagiotis Takis Chytas · Hyunwoo J. Kim · Vikas Singh

The widespread use of pre-trained large language models (and multi-modal models) has led to powerful performance across a wide range of tasks. Despite their effectiveness, we have limited knowledge of their internal knowledge representation. Motivated by and using the classic problem of Compositional Zero-Shot Learning (CZSL) as an example, we provide a structured view of the latent space that any general model (LLM or otherwise) should nominally respect. Based on this view, we first provide a practical solution to the CZSL problem based on a cross-attention mechanism that can deal with both Open and Closed-World single-attribute compositions as well as multi-attribute compositions with relative ease. In all three tasks, our approach yields performance competitive with methods designed solely for that task (i.e., adaptations to other tasks are difficult). Then, we extend this perspective to existing LLMs and ask to what extent they satisfy our axiomatic definitions. Our analysis shows a mix of interesting and unsurprising findings, but nonetheless suggests that our criteria is meaningful and may yield a more structured approach to training such models, strategies for additional data collection, and diagnostics beyond visual inspection.

# 98

Strong Double Blind

This Probably Looks Exactly Like That: An Invertible Prototypical Network

Zachariah Carmichael · Timothy Redgrave · Daniel Gonzalez Cedre · Walter Scheirer

We combine concept-based neural networks with generative, flow-based classifiers into a novel, intrinsically explainable, exactly invertible approach to supervised learning. Prototypical neural networks, a type of concept-based neural network, represent an exciting way forward in realizing human-comprehensible machine learning without concept annotations, but a human-machine semantic gap continues to haunt current approaches. We find that reliance on indirect interpretation functions for prototypical explanations imposes a severe limit on prototypes' informative power. From this, we posit that invertibly learning prototypes as distributions over the latent space provides more robust, expressive, and interpretable modeling. We propose one such model, called ProtoFlow, by composing a normalizing flow with Gaussian mixture models. ProtoFlow (1) sets a new state-of-the-art in joint generative and predictive modeling and (2) achieves predictive performance comparable to existing prototypical neural networks while enabling richer interpretation.

# 41

Strong Double Blind

Rethinking Unsupervised Outlier Detection via Multiple Thresholding

Zhonghang Liu · Panzhong Lu · Guoyang Xie · Zhichao Lu · Wen-Yan Lin

In the realm of current unsupervised image outlier detection, assigning outlier scores holds greater significance than its subsequent task: thresholding for predicting labels. This is because determining the optimal threshold on non-separable outlier score functions is an ill-posed problem. However, the lack of predicted labels not only hinders some real applications of current outlier detectors but also causes these methods not able to be enhanced by leveraging the dataset’s self-supervision. To advance existing scoring methods, we propose a multiple thresholding (Multi-T) module. It generates two thresholds that isolate inliers and outliers from the unlabelled target dataset, whereas outliers are employed to obtain better feature representation while inliers provide an uncontaminated manifold. Extensive experiments verify that Multi-T can significantly improve proposed outlier scoring methods. Moreover, Multi-T contributes to a naive distance-based method being state-of-the-art.

# 15

Strong Double Blind

Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection

Lars Doorenbos · Raphael Sznitman · Pablo Márquez Neila

The inability of deep learning models to handle data drawn from unseen distributions has sparked much interest in unsupervised out-of-distribution (U-OOD) detection, as it is crucial for reliable deep learning models. Despite considerable attention, theoretically-motivated approaches are few and far between, with most methods building on top of some form of heuristic. Recently, U-OOD was formalized in the context of data invariants, allowing a clearer understanding of how to characterize U-OOD, and methods leveraging affine invariants have attained state-of-the-art results on large-scale benchmarks. Nevertheless, the restriction to affine invariants hinders the expressiveness of the approach. In this work, we broaden the affine invariants formulation to a more general case and propose a framework consisting of a normalizing flow-like architecture capable of learning non-linear invariants. Our novel approach achieves state-of-the-art results on an extensive U-OOD benchmark, and we demonstrate its further applicability to tabular data. Finally, we show our method has the same desirable properties as those based on affine invariants.

# 76

Strong Double Blind

Multimodal Label Relevance Ranking via Reinforcement Learning

Taian Guo · Taolin Zhang · Haoqian Wu · Hanjun Li · Ruizhi Qiao · Xing Sun

Conventional multi-label recognition methods often focus on label confidence, frequently overlooking the pivotal role of partial order relations consistent with human preference. To resolve these issues, we introduce a novel method for multimodal label relevance ranking, named Label Relevance Ranking with Proximal Policy Optimization (LR\textsuperscript{2}PPO), which effectively discerns partial order relations among labels. LR\textsuperscript{2}PPO first utilizes partial order pairs in the target domain to train a reward model, which aims to capture human preference intrinsic to the specific scenario. Furthermore, we meticulously design state representation and a policy loss tailored for ranking tasks, enabling LR\textsuperscript{2}PPO to boost the performance of label relevance ranking model and largely reduce the requirement of partial order annotation for transferring to new scenes. To assist in the evaluation of our approach and similar methods, we further propose a novel benchmark dataset, LRMovieNet, featuring multimodal labels and their corresponding partial order data. Extensive experiments demonstrate that our LR\textsuperscript{2}PPO algorithm achieves state-of-the-art performance, proving its effectiveness in addressing the multimodal label relevance ranking problem. Codes and the proposed LRMovieNet dataset are publicly available at \url{https://github.com/ChazzyGordon/LR2PPO}.

# 30

Confidence Self-Calibration for Multi-Label Class-Incremental Learning

Kaile Du · Yifan Zhou · Fan Lyu · Yuyang Li · Chen Lu · Guangcan Liu

The partial label challenge in Multi-Label Class-Incremental Learning (MLCIL) arises when only the new classes are labeled during training, while past and future labels remain unavailable. This issue leads to a proliferation of false-positive errors due to erroneously high confidence multi-label predictions, exacerbating catastrophic forgetting within the disjoint label space. In this paper, we aim to refine multi-label confidence calibration in MLCIL and propose a Confidence Self-Calibration (CSC) approach. Firstly, for label relationship calibration, we introduce a class-incremental graph convolutional network that bridges the isolated label spaces by constructing learnable, dynamically extended label relationship graph. Then, for confidence calibration, we present a max-entropy regularization for each multi-label increment, facilitating confidence self-calibration through the penalization of over-confident output distributions. Our approach attains new state-of-the-art results in MLCIL tasks on both MS-COCO and PASCAL VOC datasets, with the calibration of label confidences confirmed through our methodology.

# 25

Strong Double Blind

MTaDCS: Moving Trace and Feature Density-based Confidence Sample Selection under Label Noise

Qingzheng Huang · Xilin He · Xiaole Xian · Qinliang Lin · Weicheng Xie · Siyang Song · Linlin Shen · Zitong Yu

Learning from noisy data is a challenging task, as noisy labels can compromise decision boundaries and result in suboptimal generalization performance. Most previous approaches for dealing noisy data are based on sample selection, which utilized the small loss criterion to reduce the adverse effects of noisy labels. Nevertheless, they encounter a critical limitation in being unable to effectively separate challenging samples from those that were merely mislabeled. To this end, we propose a novel moving trace and feature density-based confidence sample selection strategy (called MTaDCS). Different from existing small loss-based approaches, the local feature density of samples in the latent space is explored to construct a confidence set by selectively choosing confident samples in a progressive manner in terms of moving trace. Therefore, our MTaDCS can gradually isolate noisy labels through the setting of confidence set and achieve the goal of learning discriminative features from hard samples. Extensive experiments conducted on datasets with simulated and real-world noises validate that the proposed MTaDCS outperforms the state-of-the-art methods in terms of various metrics. We will make our code publicly available.

# 27

Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation

ChenChen Zong · Ye-Wen Wang · Kun-Peng Ning · Hai-Bo Ye · Sheng-Jun Huang

Active learning (AL) in open set scenarios presents a novel challenge of identifying the most valuable examples in an unlabeled data pool that comprises data from both known and unknown classes. Traditional methods prioritize selecting informative examples with low confidence, with the risk of mistakenly selecting unknown-class examples with similarly low confidence. Recent methods favor the most probable known-class examples, with the risk of picking simple already mastered examples. In this paper, we attempt to query examples that are both likely from known classes and highly informative, and propose a \textit{Bidirectional Uncertainty-based Active Learning} (BUAL) framework. Specifically, we achieve this by first pushing the unknown class examples toward regions with high-confidence predictions with our proposed \textit{Random Label Negative Learning} method. Then, we propose a \textit{Bidirectional Uncertainty sampling} strategy by jointly estimating uncertainty posed by both positive and negative learning to perform consistent and stable sampling. BUAL successfully extends existing uncertainty-based AL methods to complex open-set scenarios. Extensive experiments on multiple datasets with varying openness demonstrate that BUAL achieves state-of-the-art performance.

# 32

Strong Double Blind

Online Continuous Generalized Category Discovery

Keon-Hee Park · Hakyung Lee · Kyungwoo Song · Gyeong-Moon Park

With the advancement of deep neural networks in computer vision, artificial intelligence (AI) is widely employed in real-world applications. However, AI still faces limitations in mimicking high-level human capabilities, such as novel category discovery, for practical use. While some methods utilizing offline continual learning have been proposed for novel category discovery, they neglect the continuity of data streams in real-world settings. In this work, we introduce Online Continuous Generalized Category Discovery (OCGCD), which considers the dynamic nature of data streams where data can be created and deleted in real time. Additionally, we propose a novel method, DEAN, Discovery via Energy guidance and feature AugmentatioN, which can discover novel categories in an online manner through energy-guided discovery and facilitate discriminative learning via energy-based contrastive loss. Furthermore, DEAN effectively pseudo-labels unlabeled data through variational feature augmentation. Experimental results demonstrate that our proposed DEAN achieves outstanding performance in proposed OCGCD scenario. The code will be publicly released after the review.

# 26

Strong Double Blind

Open-set Domain Adaptation via Joint Error based Multi-class Positive and Unlabeled Learning

Dexuan Zhang · Thomas Westfechtel · Tatsuya Harada

Open-set domain adaptation aims to improve the generalization performance of a learning algorithm on a more realistic problem of open-set domain shift where the target data contains an additional unknown class that is not present in the source data. Most existing algorithms include two phases that can be described as closed-set domain adaptation given heuristic unknown class separation. Therefore, the generalization error cannot be strictly bounded due to the gap between the true distribution and samples inferred from heuristics. In this paper, we propose an end-to-end algorithm that tightly bound the risk of the entire target task by positive-unlabeled (PU) learning theory and the joint error from domain adaptation. Extensive experiments on various data sets demonstrate the effectiveness and efficiency of our proposed algorithm over open-set domain adaptation baselines.

# 22

Strong Double Blind

UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework

Tarun Kalluri · Sreyas Ravichandran · Manmohan Chandraker

In this work, we delve deep into the diverse factors that influence the efficacy of modern unsupervised domain adaptation (UDA) methods using a large-scale, controlled empirical study. To facilitate our analysis, we first develop UDA-Bench, a novel PyTorch framework that standardizes training and evaluation for domain adaptation enabling fair comparisons across several UDA methods. Using UDA-Bench, our comprehensive empirical study into the impact of backbone architectures, unlabeled data quantity, and pre-training datasets reveals that: (i) the benefits of adaptation methods diminish with advanced backbones, (ii) current methods underutilize unlabeled data, and (iii) pre-training data significantly affects downstream adaptation in both supervised and self-supervised settings. In the context of unsupervised adaptation, these observations uncover several novel and surprising properties, while scientifically validating several others that were often considered empirical heuristics or practitioner intuitions in the absence of a standardized training and evaluation framework. The UDA-Bench framework and trained models will be publicly released to promote future UDA research.

# 28

Strong Double Blind

Rethinking Few-shot Class-incremental Learning: Learning from Yourself

Yu-Ming Tang · Yi-Xing Peng · Jing-Ke Meng · WEISHI ZHENG

Few-shot class-incremental learning (FSCIL) aims to learn sequential classes with limited samples in a few-shot fashion. Inherited from the classical class-incremental learning setting, the popular benchmark of FSCIL uses averaged accuracy (aAcc) and last-task averaged accuracy (lAcc) as the evaluation metrics. However, we reveal that such evaluation metrics may not provide adequate emphasis on the novel class performance, and the continual learning ability of FSCIL methods could be ignored under this benchmark. In this work, as a complement to existing metrics, we offer a new metric called generalized average accuracy (gAcc) which is designed to provide an extra equitable evaluation by incorporating different perspectives of the performance under the guidance of a parameter α. We also present an overall metric in the form of the area under the curve (AUC) along the α. Under the guidance of gAcc, we release the potential of intermediate features of the vision transformers to boost the novel-class performance. Taking information from intermediate layers which are less class-specific and more generalizable, we manage to rectify the final features, leading to a more generalizable transformer-based FSCIL framework. Without complex network designs or cumbersome training procedures, our method outperforms existing FSCIL methods at aAcc and gAcc on three datasets.

# 23

Strong Double Blind

Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning

Minyeong Park · Jae-Ho Lee · Gyeong-Moon Park

Incremental Learning (IL) aims to accumulate knowledge from sequential input tasks while overcoming catastrophic forgetting. Existing IL methods typically assume that an incoming task has only increments of classes or domains, referred to as Class IL (CIL) or Domain IL (DIL), respectively. In this work, we consider a more challenging and realistic but under-explored IL scenario, named Versatile Incremental Learning (VIL), in which a model has no prior of which of the classes or domains will increase in the next task. In the proposed VIL scenario, the model faces intra-class domain confusion and inter-domain class confusion, which make the model fail to accumulate new knowledge without interference with learned knowledge. To address these issues, we propose a simple yet effective IL framework, named Incremental Classifier with Adaptation Shift cONtrol (ICON). Based on shifts of learnable modules, we design a novel regularization method called Cluster-based Adaptation Shift conTrol (CAST) to control the model to avoid confusion with the previously learned knowledge and thereby accumulate the new knowledge more effectively. Moreover, we introduce Incremental Classifier (IC) which expands its output nodes to address the overwriting issue from different domains corresponding to a single class while maintaining the previous knowledge. We conducted extensive experiments on three benchmarks, showcasing the effectiveness of our method across all the scenarios, particularly in cases where the next task can be randomly altered.

# 88

Semantic Residual Prompts for Continual Learning

Martin Menabue · Emanuele Frascaroli · Matteo Boschini · Enver Sangineto · Lorenzo Bonicelli · Angelo Porrello · Simone Calderara

Prompt-tuning methods for Continual Learning (CL) freeze a large pre-trained model and focus training on a few parameter vectors termed prompts. Most of these methods organize these vectors in a pool of key-value pairs, and use the input image as query to retrieve the prompts (values). However, as keys are learned while tasks progress, the prompting selection strategy is itself subject to catastrophic forgetting, an issue often overlooked by existing approaches. For instance, prompts introduced to accommodate new tasks might end up interfering with previously learned prompts. To make the selection strategy more stable, we ask a foundational model (CLIP) to select our prompt within a two-level adaptation mechanism. Specifically, the first level leverages standard textual prompts for the CLIP textual encoder, leading to stable class prototypes. The second level, instead, uses these prototypes along with the query image as keys to index a second pool. The retrieved prompts serve to adapt a pre-trained ViT, granting plasticity. In doing so, we also propose a novel residual mechanism to transfer CLIP semantics to the ViT layers. Through extensive analysis on established CL benchmarks, we show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Notably, our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model, as showcased by experiments on satellite imagery and medical datasets. The codebase is available in the suppl. materials.

# 91

Strong Double Blind

Encapsulating Knowledge in One Prompt

Qi Li · Runpeng Yu · Xinchao Wang

In this paper, we propose a new knowledge transfer paradigm called Knowledge in One Prompt (KiOP). This paradigm encapsulates knowledge from various models into a solitary prompt without altering the original models or requiring access to the training data, which enables us to achieve efficient and convenient knowledge transfer in more realistic scenarios. From a practicality standpoint, this paradigm not only for the first time proves the effectiveness of Visual Prompt in data inaccessible contexts, but also solves the problems of low model reusability and high storage resource consumption faced by traditional Data-Free Knowledge Transfer, which means that we can realize the parallel knowledge transfer of multiple models without modifying any source model. Extensive experiments across various datasets and models demonstrate the efficacy of the proposed KiOP knowledge transfer paradigm. Without access to real training data and with rigorous storage capacity constraints, it is also capable of yielding considerable outcomes when dealing with cross-model backbone setups and handling parallel knowledge transfer processing requests with multiple (more than 2) models.

# 18

Strong Double Blind

Representation Enhancement-Stabilization: Reducing Bias-Variance of Domain Generalization

Wei Huang · Yilei Shi · Zhitong Xiong · Xiao Xiang Zhu

Domain Generalization (DG) focuses on enhancing the generalization of deep learning models trained on multiple source domains to adapt to unseen target domains. This paper explores DG through the lens of bias-variance decomposition, uncovering that test errors in DG predominantly arise from cross-domain bias and variance. Inspired by this insight, we introduce a Representation Enhancement-Stabilization (RES) framework, comprising a Representation Enhancement (RE) module and a Representation Stabilization (RS) module. In RE, a novel set of feature frequency augmentation techniques is used to progressively reduce cross-domain bias during feature extraction. Furthermore, in RS, a novel Mutual Exponential Moving Average (MEMA) strategy is designed to stabilize model optimization for diminishing cross-domain variance during training. Collectively, the whole RES method can significantly enhance model generalization. We evaluate RES on five benchmark datasets and the results show that it outperforms multiple advanced DG methods. Our code will be publicly available.

# 94

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

Amin Parchami · Moritz Böhle · Sukrut Rao · Bernt Schiele

Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models. While it is well known that student models can achieve similar accuracies as the teachers, it has also been shown that they nonetheless often do not learn the same function. It is, however, often highly desirable that the student’s and teacher’s functions share similar properties such as basing the prediction on the same input features, as this ensures that students learn the ‘right features’ from the teachers. In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student. Despite the idea being simple and intuitive, we find that our proposed ‘explanation-enhanced’ KD (e2KD) (1) consistently provides large gains in terms of accuracy and student-teacher agreement, (2) ensures that the student learns from the teacher to be right for the right reasons and to give similar explanations, and (3) is robust with respect to the model architectures, the amount of training data, and even works with ‘approximate’, pre-computed explanations.

# 11

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Yizhe Xiong · Hui Chen · Tianxiang Hao · Zijia Lin · Jungong Han · Yuesong Zhang · Guoxin Wang · Yongjun Bao · Guiguang Ding

Recently, the scale of transformers has grown rapidly, which introduces considerable challenges in terms of training overhead and inference efficiency in the scope of task adaptation. Existing works, namely Parameter-Efficient Fine-Tuning (PEFT) and model compression, have separately investigated the challenges. However, PEFT cannot guarantee the inference efficiency of the original backbone, especially for large-scale models. Model compression requires significant training costs for structure searching and re-training. Consequently, a simple combination of them cannot guarantee accomplishing both training efficiency and inference efficiency with minimal costs. In this paper, we propose a novel Parallel Yielding Re-Activation (PYRA) method for such a challenge of training-inference efficient task adaptation. PYRA first utilizes parallel yielding adaptive weights to comprehensively perceive the data distribution in downstream tasks. A re-activation strategy for token modulation is then applied for tokens to be merged, leading to calibrated token features. Extensive experiments demonstrate that PYRA outperforms all competing methods under both low compression rate and high compression rate, demonstrating its effectiveness and superiority in maintaining both training efficiency and inference efficiency for large-scale foundation models. Our code will be released to the public.

# 16

Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation

YUE XU · Yong-Lu Li · Kaitong Cui · Ziyu Wang · Cewu Lu · Yu-Wing Tai · Chi-Keung Tang

Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation has become an effective approach for data-efficiency by synthesizing data samples that are most essential for network training; however, it remains to be explored which samples are essential for the dataset distillation process itself. In this work, we study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset, both theoretically and empirically. Thus we propose to use the empirical loss value as a static data pruning criterion. To further compensate for the variation of the data value in training, we propose to find the most contributing samples based on their causal effects on the distillation. The proposed selection strategy can efficiently exploit the training dataset and outperform the previous SOTA distillation algorithms, and consistently enhance the distillation algorithms, even on much larger-scale and more heterogeneous datasets, e.g. full ImageNet-1K and Kinetics-400. We believe this paradigm will open up new avenues in the dynamics of distillation and pave the way for efficient dataset distillation. Our code will be made publicly available.

# 9

Strong Double Blind

Dataset Distillation by Automatic Training Trajectories

Dai Liu · Jindong Gu · Hu Cao · Carsten Trinitis · Martin Schulz

Dataset Distillation is used to create a concise, yet informative, synthetic dataset that can replace the original dataset for training purposes. Some leading methods in this domain prioritize long-range matching, involving the unrolling of training trajectories with a fixed number of steps ($N_{S}$) on the synthetic dataset to align with various expert training trajectories. However, traditional long-range matching methods possess an overfiting-like problem, the fixed step size $N_{S}$ forces synthetic dataset to distortedly conform seen expert training trajectories, resulting in a loss of generality—especially to those from unencountered architecture. We referred to this as the Accumulated Mismatching Problem (AMP). And we propose a new approach, Automatic Training Trajectories (ATT), which dynamically and adaptively adjusts trajectory length $N_{S}$ to address the AMP. Our method outperforms existing methods particularly in tests involving cross-architectures. Moreover, owing to its adaptive nature, it exhibits enhanced stability in the face of parameter variations. \keywords{Dataset Distillation \and Task-Specific Dataset Compression \and Dataset Condensation}

# 10

Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction

Shuchi Wu · Chuan Ma · Kang Wei · Xiaogang Xu · Ming Ding · Yuwen Qian · Di Xiao · Tao Xiang

This paper introduces \textbf{RDA}, a pioneering approach designed to address two primary deficiencies prevalent in previous endeavors aiming at stealing pre-trained encoders: (1) suboptimal performances attributed to biased optimization objectives, and (2) elevated query costs stemming from the end-to-end paradigm that necessitates querying the target encoder every epoch. Specifically, we initially \textbf{\underline{R}}efine the representations of the target encoder for each training sample, thereby establishing a less biased optimization objective before the steal-training phase. This is accomplished via a sample-wise prototype, which consolidates the target encoder's representations for a given sample's various perspectives. Demanding exponentially fewer queries compared to the end-to-end approach, prototypes can be instantiated to guide subsequent query-free training. For more potent efficacy, we develop a multi-relational extraction loss that trains the surrogate encoder to \textbf{\underline{D}}iscriminate mismatched embedding-prototype pairs while \textbf{\underline{A}}ligning those matched ones in terms of both amplitude and angle. In this way, the trained surrogate encoder achieves state-of-the-art results across the board in various downstream datasets with limited queries. Moreover, RDA is shown to be robust to multiple widely-used defenses. Our code is available at https://anonymous.4open.science/r/RDA.

# 95

Strong Double Blind

Graph Neural Network Causal Explanation via Neural Causal Models

Arman Behnam · Binghui Wang

Graph neural network (GNN) explainers for graph classification aim to identify the important subgraph that ensures the prediction for a given graph. Until now, most of the existing GNN explainers are based on association, and a few are causality-inspired but they are associated-based in essence. Associated-based explainers are shown to be prone to spurious correlations. We propose CXGNN, a GNN causal explainer via causal inference. Our explainer is based on the observation that a graph often consists of a causal subgraph. Specifically, CXGNN includes three main steps: 1) Building causal structure and the corresponding structural causal model (SCM) for a graph, which enables the cause-effect calculation among nodes. 2) Directly calculating the cause-effect in real-world graphs is computationally challenging. We are then enlightened by the recently proposed neural causal model (NCM), a special type of SCM that is trainable, and the design of customized NCMs for GNNs. By training these GNN NCMs, the cause-effect can be easily calculated. 3) We uncover the subgraph that causally explains the GNN predictions via the well-trained GNN NCMs. Evaluation results on multiple synthetic and real-world graphs validate that CXGNN significantly outperforms the existing GNN explainers in exactly finding the ground-truth explanations.

# 17

Strong Double Blind

Optimization-based Uncertainty Attribution Via Learning Informative Perturbations

Hanjing Wang · Bashirul Azam Biswas · Qiang Ji

Uncertainty Attribution (UA) aims to understand the sources of uncertainty in deep learning models by identifying key contributors to predictive uncertainty. To improve the faithfulness of existing UA methods, we formulate UA as an optimization problem to learn a binary mask on the input, identifying regions that significantly contribute to output uncertainty. The learned mask allows uncertainty reduction through learning informative perturbations on the masked input. Our method enhances UA interpretability and maintains high efficiency by integrating three key improvements: Segment Anything Model (SAM)-guided mask parameterization for efficient and interpretable mask learning; learnable perturbations that surpass traditional techniques by adaptively targeting and refining problematic regions specific to each input without the need for manually tuning the perturbation parameters; and a novel application of Gumbel-sigmoid reparameterization for efficiently learning Bernoulli-distributed binary masks under continuous optimization. Our experiments on the detection of problematic regions and faithfulness tests demonstrate our method's superiority over state-of-the-art UA methods.

# 7

Strong Double Blind

Generalizable Symbolic Optimizer Learning

Xiaotian Song · Peng Zeng · Yanan Sun · Andy Song

Existing automated symbolic optimizer design methods necessitate the use of proxies, often resulting in significant performance degradation when transferring to a target domain. In this paper, we propose a learning based model called Symbolic Optimizer Learner (SOL) that can discover high-performance symbolic optimizers directly on the target. SOL is integrated with symbols and can be directly transformed into a symbolic optimizer. In addition, an unrolled optimization approach is introduced for SOL training. SOL can be embedded into the training process of neural networks, optimizing the target directly without any proxies. Our extensive experiments demonstrate the good performance and high generalizability of SOL through diverse tasks, ranging from classifications to adversarial attacks, from GNN to NLP tasks. On image classification, SOL achieved ~5x speedup and ~3\% accuracy gain. On adversarial attacks, SOL achieved the best attack success rate across seven SOTA defense models. On GNN training, SOL discovered optimizers can outperform Adam on three different datasets. On BERT fine-tuning, SOL also outperformed AdamW on five benchmarks.

# 6

Strong Double Blind

CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction

Shengke Sun · Ziqian Luan · Zhanshan Zhao · Shijie Luo · Zhanshan Zhao

Generative Adversarial Networks(GANs) have received considerable attention due to its outstanding ability to generate images. However, training a GAN is hard since the game between the Generator(G) and the Discriminator(D) is unfair. Towards making the competition fairer, we propose a new perspective of training GANs, named Consistent Latent Representation and Reconstruction(CLR-GAN). In this paradigm, we treat the G and D as an inverse process, the discriminator has an additional task to restore the pre-defined latent code while the generator also needs to reconstruct the real input, thus obtaining a relationship between the latent space of G and the out-features of D. Based on this prior, we can put D and G on an equal position during training using a new criterion. Experimental results on various datasets and architectures prove our paradigm can make GANs more stable and generate better quality images(31.22% gain of FID on CIFAR10 and 39.5% on AFHQ-Cat}, respectively). We hope that the proposed perspective can inspire researchers to explore different ways of viewing GANs training, rather than being limited to a two-player game. The code will be publicly available soon at [Removed for blind review].

# 8

Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation

Sangyeop Yeo · Yoojin Jang · Jaejun Yoo

In this paper, we address the challenge of compressing generative adversarial networks (GANs) for deployment in resource-constrained environments by proposing two novel methodologies: Distribution Matching for Efficient compression (DiME) and Network Interactive Compression via Knowledge Exchange and Learning (NICKEL). DiME employs foundation models as embedding kernels for efficient distribution matching, leveraging maximum mean discrepancy to facilitate effective knowledge distillation. Simultaneously, NICKEL employs an interactive compression method that enhances the communication between the student generator and discriminator, achieving a balanced and stable compression process. Our comprehensive evaluation on the StyleGAN2 architecture with the FFHQ dataset shows the effectiveness of our approach, with NICKEL & DiME achieving FID scores of 10.45 and 15.93 at compression rates of 95.73% and 98.92%, respectively. Remarkably, our methods sustain generative quality even at an extreme compression rate of 99.69%, surpassing the previous state-of-the-art performance by a large margin. These findings not only demonstrate our methodologies' capacity to significantly lower GANs' computational demands but also pave the way for deploying high-quality GAN models in settings with limited resources. Our code will be released soon.

# 13

Strong Double Blind

Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense

Jeremy Styborski · Mingzhi Lyu · YI HUANG · Wai-Kin Adams Kong

Availability poisons exploit supervised learning (SL) algorithms by introducing class-related shortcut features in images such that models trained on poisoned data are useless for real-world datasets. Self-supervised learning (SSL), which utilizes augmentations to learn instance discrimination, is regarded as a strong defense against poisoned data. However, by extending the study of SSL across multiple poisons on the CIFAR-10 and ImageNet-100 datasets, we demonstrate that it often performs poorly, far below that of training on clean data. Leveraging the vulnerability of SL to poison attacks, we introduce adversarial training (AT) on SL to obfuscate poison features and guide robust feature learning for SSL. Our proposed defense, designated VESPR (Vulnerability Exploitation of Supervised Poisoning for Robust SSL), surpasses the performance of six previous defenses across seven popular availability poisons. VESPR displays superior performance over all previous defenses, boosting the minimum and average ImageNet-100 test accuracies of poisoned models by 16% and 9%, respectively. Through analysis and ablation studies, we elucidate the mechanisms by which VESPR learns robust class features. Code for VESPR will be made available upon conference acceptance of this paper.

# 14

SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning

Mengxin Zheng · Jiaqi Xue · Zihao Wang · Xun Chen · Qian Lou · Lei Jiang · Xiaofeng Wang

Self-supervised learning (SSL) is a prevalent approach for encoding data representations. Using a pre-trained SSL image encoder and subsequently training a downstream classifier, impressive performance can be achieved on various tasks with very little labeled data. The growing adoption of SSL has led to an increase in security research on SSL encoders and associated Trojan attacks. Trojan attacks embedded in SSL encoders can operate covertly, spreading across multiple users and devices. The presence of backdoor behavior in Trojaned encoders can inadvertently be inherited by downstream classifiers, making it even more difficult to detect and mitigate the threat. Although current Trojan detection methods in supervised learning can potentially safeguard SSL downstream classifiers, identifying and addressing triggers in the SSL encoder before its widespread dissemination is a challenging task. This challenge arises because downstream tasks might be unknown, dataset labels may be unavailable, and the original unlabeled training dataset might be inaccessible during Trojan detection in SSL encoders. We introduce \textbf{SSL-Cleanse} as a solution to identify and mitigate backdoor threats in SSL encoders. We evaluated SSL-Cleanse on various datasets using 1200 encoders, achieving an average detection success rate of $82.2\%$ on ImageNet-100. After mitigating backdoors, on average, backdoored encoders achieve $0.3\%$ attack success rate without great accuracy loss, proving the effectiveness of SSL-Cleanse.

# 195

How Video Meetings Change Your Expression

Sumit Sarin · Utkarsh Mall · Purva Tendulkar · Carl Vondrick

Does our behavior change when we speak over video calls? Given two unpaired sets of videos of people, we seek to automatically find spatio-temporal patterns that are distinctive of each set. Existing methods use discriminative approaches and perform post-hoc explainability analysis. Such methods are insufficient as they are unable to provide insights beyond obvious dataset biases, and the explanations are useful only if humans themselves are good at the task. Instead, we tackle the problem through the lens of generative domain translation: our method generates a detailed report of learned, input-dependent spatio-temporal features and the extent to which they vary between the domains. We demonstrate that our method can discover behavioral differences between conversing face-to-face (F2F) and on video-calls (VCs). We also show the applicability of our method on discovering differences in presidential communication styles. Additionally, we are able to predict temporal change-points in videos that decouple expressions in an unsupervised way, and increase the interpretability and usefulness of our model. Finally, our method, being generative, can be used to transform a video call to appear as if it were recorded in a F2F setting. Experiments and visualizations show our approach is able to discover a range of behaviors, taking a step towards deeper understanding of human behaviors.

# 42

Strong Double Blind

MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks

Elad Hirsch · Gefen Dawidowicz · Ayellet Tal

Generating medical reports for X-ray images is a challenging task, particularly in an unpaired scenario where paired image-report data is unavailable for training. To address this challenge, we propose a novel model that leverages the available information in two distinct datasets, one comprising reports and the other consisting of images. The core idea of our model revolves around the notion that combining auto-encoding report generation with multi-modal (report-image) alignment can offer a solution. However, the challenge persists regarding how to achieve this alignment when pair correspondence is absent. Our proposed solution involves the use of auxiliary tasks, particularly contrastive learning and classification, to position related images and reports in close proximity to each other. This approach differs from previous methods that rely on pre-processing steps using external information stored in a knowledge graph. Our model, named MedRAT, surpasses previous state-of-the-art methods, demonstrating the feasibility of generating comprehensive medical reports without the need for paired data or external tools.