Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 3

Exhibition Area
Wed 2 Oct 1:30 a.m. PDT — 3:30 a.m. PDT
Abstract:
Chat is not available.


# 95
Parrot Captions Teach CLIP to Spot Text

Yiqi Lin · Conghui He · Alex Jinpeng Wang · Bin Wang · Li Weijia · Mike Zheng Shou

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content and around 30% of captions words are concurrently embedded in the visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is a dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models across various vision-language downstream tasks. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.


# 32
Strong Double Blind
Towards Model-Agnostic Dataset Condensation by Heterogeneous Models

Jun-Yeong Moon · Jung Uk Kim · Gyeong-Moon Park

The advancement of deep learning has coincided with the proliferation of both models and available data. The surge in dataset sizes and the subsequent surge in computational requirements have led to the development of the Dataset Condensation (DC). While prior studies have delved into generating synthetic images through methods like distribution alignment and training trajectory tracking for more efficient model training, a significant challenge arises when employing these condensed images practically. Notably, these condensed images tend to be specific to particular models, constraining their versatility and practicality. In response to this limitation, we introduce a novel method, Heterogeneous Model Dataset Condensation (HMDC), designed to produce universally applicable condensed images through cross-model interactions. To address the issues of gradient magnitude difference and semantic distance in models when utilizing heterogeneous models, we propose the Gradient Balance Module (GBM) and Mutual Distillation (MD) with the Spatial-Semantic Decomposition method. By balancing the contribution of each model and maintaining their semantic meaning closely, our approach overcomes the limitations associated with model-specific condensed images and enhances the broader utility.


# 148
Strong Double Blind
VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking

Jens Hellekes · Manuel Mühlhaus · Reza Bahmanyar · Seyed Majid Azimi · Franz Kurz

The informative power of traffic analysis can be enhanced by considering changes in both time and space. Vehicle tracking algorithms applied to drone videos provide a better overview than street-level surveillance cameras. However, existing aerial MOT datasets only cover stationary settings, leaving the performance in moving-camera scenarios covering a considerably larger area unknown. To fill this gap, we present VETRA, a dataset for vehicle tracking in aerial imagery introducing heterogeneity in terms of camera movement, frame rate, as well as type, size and number of objects. When dealing with these challenges, state-of-the-art online MOT algorithms exhibit a significant decrease in performance compared to other benchmark datasets. Despite the performance gains achieved by our baseline method through the integration of camera motion compensation, there remains potential for improvement, particularly in situations where vehicles have similar visual appearance, prolonged occlusions, and complex urban driving patterns. Making VETRA available to the community adds a missing building block for both testing and developing vehicle tracking algorithms for versatile real-world applications.


# 147
Insect Identification in the Wild: The AMI Dataset

Aditya Jain · Fagner Cunha · Michael J Bunsen · Juan Sebastián Cañas · Léonard Pasi · Nathan Pinoy · Flemming Helsing · JoAnne Russo · Marc S Botham · Michael Sabourin · Jonathan Fréchette · Alexandre Anctil · Yacksecari Lopez · Eduardo Navarro · Filonila Pérez · Ana C Zamora · Jose Alejandro Ramirez-Silva · Jonathan Gagnon · Tom A August · Kim Bjerge · Alba Gomez Segura · Marc Belisle · Yves Basset · Kent P McFarland · David B Roy · Toke T Høye · Maxim Larrivee · David Rolnick

Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets will be made publicly available.


# 106
Towards Open-ended Visual Quality Comparison

Haoning Wu · Hanwei Zhu · Zicheng Zhang · Erli Zhang · Chaofeng Chen · Liang Liao · Chunyi Li · Annan Wang · Wenxiu Sun · Qiong Yan · Xiaohong Liu · Guangtao Zhai · Shiqi Wang · Weisi Lin

Comparative settings (e.g. pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offer more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-Instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image quality description, (b) GPT-4V “teacher” responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MICBench, the first benchmark on multi-image comparison for LMMs. We demonstrate that Co-Instruct not-only achieves in average 30% higher accuracy than state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher), on both existing quality-related benchmarks and the proposed MICBench. We will publish our datasets, training scripts and model weights upon acceptance.


# 116
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Cong Wei · Yang Chen · Haonan Chen · Hexiang Hu · Ge Zhang · Jie Fu · Alan Ritter · Wenhu Chen

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.


# 146
Strong Double Blind
MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

Ziqiang Zheng · Yiwei Chen · Huimin Zeng · Tuan-Anh Vu · Binh-Son Hua · Sai Kit Yeung

Recent foundation models trained on a tremendous scale of data have shown great promise in a wide range of computer vision tasks and application domains. However, less attention has been paid to the marine realms, which in contrast cover the majority of our blue planet. The scarcity of labeled data is the most hindering issue, and marine photographs illustrate significantly different appearances and contents from general in-air images. Using existing foundation models for marine visual analysis does not yield satisfactory performance, due to not only the data distribution shift, but also the intrinsic limitations of the existing foundation models (e.g., lacking semantics, redundant mask generation, or restricted to image-level scene understanding). In this work, we emphasize both model and data approaches for understanding marine ecosystems. We introduce MarineInst, a foundation model for the analysis of the marine realms with instance visual description, which outputs instance masks and captions for marine object instances. To train MarineInst, we acquire MarineInst20M, the largest marine image dataset to date, which contains a wide spectrum of marine images with high-quality semantic instance masks constructed by a mixture of human-annotated instance masks and model-generated instance masks from our automatic procedure of binary instance filtering. To generate informative and detailed semantic instance captions, we use vision-language models to produce semantic richness with various granularities. Our model and dataset support a wide range of marine visual analysis tasks, from image-level scene understanding to regional mask-level instance understanding. More significantly, MarineInst exhibits strong generalization ability and flexibility to support a wide range of downstream tasks with state-of-the-art performance.


# 152
Adaptive Correspondence Scoring for Unsupervised Medical Image Registration

Xiaoran Zhang · John C. Stendahl · Lawrence H. Staib · Albert J. Sinusas · Alex Wong · James S. Duncan

We propose an adaptive training scheme for unsupervised medical image registration. Existing methods rely on image reconstruction as the primary supervision signal. However, nuisance variables (e.g. noise and covisibility) often cause the loss of correspondence between medical images, violating the Lambertian assumption in physical waves (e.g. ultrasound) and consistent image acquisition. As the unsupervised learning scheme relies on intensity constancy between images to establish correspondence for reconstruction, this introduces spurious error residuals that are not modeled by the typical training objective. To mitigate this, we propose an adaptive framework that re-weights the error residuals with a correspondence scoring map during training, preventing the parametric displacement estimator from drifting away due to noisy gradients, which leads to performance degradation. To illustrate the versatility and effectiveness of our method, we tested our framework on three representative registration architectures across three medical image datasets along with other baselines. Our adaptive framework consistently outperforms other methods both quantitatively and qualitatively. Paired t-tests show that our improvements are statistically significant. Code available at: \url{https://anonymous.4open.science/r/AdaCS-E8B4/}.


# 145
Strong Double Blind
Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View

Jianan Fan · Dongnan Liu · Canran Li · Hang Chang · Heng Huang · Filip Braet · Mei Chen · Weidong Cai

Cellular nuclei recognition serves as a fundamental and essential step in the workflow of digital pathology. However, with disparate source organs and staining procedures among histology image clusters, the scanned tiles inherently conform to a non-uniform data distribution, which induces deteriorated promises for general cross-domain usages. Despite the latest efforts leveraging domain adaptation to mitigate distributional discrepancy, those methods are subjected to modeling the morphological characteristics of each cell individually, disregarding the hierarchical latent structure and intrinsic contextual correspondences across the tumor micro-environment. In this work, we identify the importance of implicit correspondences across biological contexts for exploiting domain-invariant pathological composition and thereby propose to explore the dependence over various biological structures for domain adaptive nuclei recognition. We discover those high-level correspondences via unsupervised contextual modeling and use them as bridges to facilitate adaptation at image and instance feature levels. In addition, to further exploit the rich spatial contexts embedded amongst nuclear communities, we propose self-adaptive dynamic distillation to secure instance-aware trade-offs across different model constituents. The proposed method is extensively evaluated on a broad spectrum of cross-domain settings under miscellaneous data distribution shifts and outperforms the state-of-the-art methods by a substantial margin.


# 110
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

Xiao Zhou · Xiaoman Zhang · Chaoyi Wu · Ya Zhang · Weidi Xie · Yanfeng Wang

In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain-specific knowledge in pathology. Specifically, we make the following contributions: (i) We curate a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues. To our knowledge, this is the first comprehensive structured pathology knowledge base; (ii) We develop a knowledge-enhanced visual-language pretraining approach, where we first project pathology-specific knowledge into latent embedding space via language model, and use it to guide the learning of visual representation; (iii) We conduct thorough experiments to validate the effectiveness of our proposed components, demonstrating significant performance improvement on various downstream tasks, including cross-modal retrieval, zero-shot classification on pathology patches, and zero-shot tumor subtyping on whole slide images (WSIs). All codes, models and the pathology knowledge tree will be released to the research community.


# 156
Strong Double Blind
SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images

Jintu Zheng · Yi Ding · Qizhe Liu · Yuehui Chen · Yi Cao · Ying Hu · Zenan Wang

Traditional fluorescence staining is phototoxic to live cells, slow, and expensive; thus, the subcellular structure prediction (SSP) from transmitted light (TL) images is emerging as a label-free, faster, low-cost alternative. However, existing approaches utilize 3D etworks for one-to-one voxel level dense prediction, which necessitates a frequent and time-consuming Z-axis imaging process. Moreover, 3D convolutions inevitably lead to significant computation and GPU memory overhead. Therefore, we propose an efficient framework, SparseSSP, predicting fluorescent intensities within the target voxel grid in an efficient paradigm instead of relying entirely on 3D topologies. In particular, SparseSSP makes two pivotal improvements to prior works. First, SparseSSP introduces a one-to-many voxel mapping paradigm, which permits the sparse TL slices to reconstruct the subcellular structure. Secondly, we propose a hybrid dimensions topology, which folds the Z-axis information into channel features, enabling the 2D network layers to tackle SSP under low computational cost. We conduct extensive experiments to validate the effectiveness and advantages of SparseSSP on diverse sparse imaging ratios, and our approach achieves a leading performance compared to pure 3D topologies. SparseSSP reduces imaging frequencies compared to previous dense-view SSP (i.e., the number of imaging is reduced up to 87.5% at most), which is significant in visualizing rapid biological dynamics on low-cost devices and samples.


# 154
Strong Double Blind
CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos

JIEWEN YANG · Yiqun Lin · Bin Pu · Jiarong GUO · Xiaowei Xu · Xiaomeng Li

Echocardiography plays a crucial role in analyzing cardiac function and diagnosing cardiac diseases. Current deep neural network methods primarily aim to enhance diagnosis accuracy by incorporating prior knowledge, such as segmenting cardiac structures or lesions annotated by human experts. However, diagnosing the inconsistent behaviours of the heart, which exist across both spatial and temporal dimensions, remains extremely challenging. For instance, the analysis of cardiac motion acquires both spatial and temporal information from the heartbeat cycle. To address this issue, we propose a novel reconstruction-based approach named CardiacNet to learn a better representation of local cardiac structures and motion abnormalities through echocardiogram videos. CardiacNet is accompanied by the Consistency Deformation Codebook (CDC) and the Consistency Deformed-Discriminator (CDD) to learn the commonalities across abnormal and normal samples by incorporating cardiac prior knowledge. In addition, we propose benchmark datasets named CardiacNet-PAH and CardiacNet-ASD for evaluating the effectiveness of cardiac disease assessment. In experiments, our CardiacNet can achieve state-of-the-art results in downstream tasks, i.e., classification and regression, on public datasets CAMUS, EchoNet, and our datasets. The codes and datasets will be publicly available.


# 108
PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

YUXUAN SUN · Hao Wu · Chenglu Zhu · Sunyi Zheng · Qizi Chen · Kai Zhang · Yunlong Zhang · Dan Wan · Xiaoxiao Lan · Mengyue Zheng · Jingxiong Li · Xinheng Lyu · Tao Lin · Lin Yang

The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.


# 105
PointLLM: Empowering Large Language Models to Understand Point Clouds

Runsen Xu · Xiaolong Wang · Tai Wang · Yilun Chen · Jiangmiao Pang · Dahua Lin

The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, empowering LLMs to understand point clouds and offering a new avenue beyond 2D data. PointLLM understands colored object point clouds with human instructions and generates contextually appropriate responses, illustrating its grasp of point clouds and common sense. Specifically, it leverages a point cloud encoder with a powerful LLM to effectively fuse geometric, appearance, and linguistic information. To overcome the scarcity of point-text instruction following data, we developed an automated data generation pipeline, collecting a large-scale dataset of more than 730K samples with 660K different objects, which facilitates the adoption of the two-stage training strategy prevalent in MLLM development. Additionally, we address the absence of appropriate benchmarks and the limitations of current evaluation metrics by proposing two novel benchmarks: Generative 3D Object Classification and 3D Object Captioning, which are supported by new, comprehensive evaluation metrics derived from human and GPT analyses. Through exploring various training strategies, we develop PointLLM, significantly surpassing 2D and 3D baselines, with a notable achievement in human-evaluated object captioning tasks where it surpasses human annotators in over 50% of the samples. Codes, datasets, and benchmarks are available at https://github.com/OpenRobotLab/PointLLM.


# 157
Strong Double Blind
HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

Tianpei Zou · Sanqing Qu · Zhijun Li · Alois C. Knoll · 何 良华 · Guang Chen · Changjun Jiang

3D point cloud segmentation has received significant interest for its growing applications. However, the generalization ability of models suffers in dynamic scenarios due to the distribution shift between test and training data. To promote robustness and adaptability across diverse scenarios, test-time adaptation (TTA) has recently been introduced. Nevertheless, most existing TTA methods are developed for images, and limited approaches applicable to point clouds ignore the inherent hierarchical geometric structures in point cloud streams, i.e., local (point-level), global (object-level), and temporal (frame-level) structures. In this paper, we delve into TTA in 3D point cloud segmentation and propose a novel Hierarchical Geometry Learning (HGL) framework. HGL comprises three complementary modules from local, global to temporal learning in a bottom-up manner. Technically, we first construct a local geometry learning module for pseudo-label generation. Next, we build prototypes from the global geometry perspective for pseudo-label fine-tuning. Furthermore, we introduce a temporal consistency regularization module to mitigate negative transfer. Extensive experiments on four datasets demonstrate the effectiveness and superiority of our HGL. Remarkably, on the SynLiDAR to SemanticKITTI task, HGL achieves an overall mIoU of 46.91\%, improving GIPSO by 3.0\% and significantly reducing the required adaptation time by 80\%.


# 149
Strong Double Blind
Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather

Junsung Park · Kyungmin Kim · Hyunjung Shim

Existing LiDAR semantic segmentation methods commonly face performance declines in adverse weather conditions. Prior research has addressed this issue by simulating adverse weather or employing universal data augmentation during training.However, these methods lack a detailed analysis and understanding of how adverse weather negatively affects LiDAR semantic segmentation performance.Motivated by this issue, we characterized adverse weather in several factors and conducted a toy experiment to identify the main factors causing performance degradation: (1) Geometric perturbation due to refraction caused by fog or droplet in the air and (2) Point drop due to energy absorption and occlusions.Based on this analysis, we propose new strategic data augmentation techniques. Specifically, we first introduced a Selective Jittering (SJ) that jitters points in the random range of depth (or angle) to mimic geometric perturbation. Additionally, we developed a Learnable Point Drop (LPD) to learn vulnerable erase patterns with Deep Q-Learning Network to approximate point drop phenomenon from adverse weather conditions.Without precise weather simulation, these techniques strengthen the LiDAR semantic segmentation model by exposing it to vulnerable conditions identified by our data-centric analysis. Experimental results confirmed the suitability of the proposed data augmentation methods for enhancing robustness against adverse weather conditions. Our method attains a remarkable 39.5 mIoU on the SemanticKITTI-to-SemanticSTF benchmark, surpassing the previous state-of-the-art by over 5.4%p, tripling the improvement over the baseline compared to previous methods achieved.


# 158
Strong Double Blind
RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation

Zhiyuan Zhang · Licheng Yang · Zhiyu Xiang

Despite the progress on 3D point cloud deep learning, most prior works focus on learning features that are invariant to translation and point permutation, and very limited efforts have been devoted for rotation invariant property. Several recent studies achieve rotation invariance at the cost of lower accuracies. In this work, we close this gap by proposing a novel yet effective rotation invariant architecture for 3D point cloud classification and segmentation. Instead of traditional pointwise operations, we construct local triangle surfaces to capture more detailed surface structure, based on which we can extract highly expressive rotation invariant surface properties which are then integrated into an attention-augmented convolution operator named RISurAAConv to generate refined attention features via self-attention layers. Based on RISurAAConv we build an effective neural network for 3D point cloud analysis that is invariant to arbitrary rotations while maintaining high accuracy. We verify the performance on various benchmarks with supreme results obtained surpassing the previous state-of-the-art by a large margin. We achieve 95.3% (+4.3%) on ModelNet40, 92.6% (+12.3%) on ScanObjectNN, and 96.4% (+7.0%), 87.6% (+13.0%), 88.7%} (+7.7%) respectively on the three categories of FG3D dataset for fine-grained classification task and achieve 81.5% (+1.0%) mIoU on ShapeNet for segmentation task, respectively. The code and models will be released upon publication.


# 151
Strong Double Blind
RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation

Luis Li · Hubert P. H. Shum · Toby P Breckon

3D point clouds play a pivotal role in outdoor scene perception, especially in the context of autonomous driving. Recent advancements in 3D LiDAR segmentation often focus intensely on the spatial positioning and distribution of points for accurate segmentation. However, these methods, while robust in variable conditions, encounter challenges due to sole reliance on coordinates and point intensity, leading to poor isometric invariance and suboptimal segmentation. To tackle this challenge, our work introduces Range-Aware Pointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg architecture. Our RAPiD features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize inherent LiDAR isotropic radiation and semantic categorization for enhanced local representation and computational efficiency, while incorporating a 4D distance metric that integrates geometric and surface material reflectivity for improved semantic segmentation. To effectively embed high-dimensional RAPiD features, we propose a double-nested autoencoder structure with a novel class-aware embedding objective to encode high-dimensional features into manageable voxel-wise embeddings. Additionally, we propose RAPiD-Seg which incorporates a channel-wise attention fusion and a two-stage training strategy, further optimizing the embedding for enhanced performance and generalization. Our method outperforms contemporary LiDAR segmentation work in terms of mIoU on SemanticKITTI (76.1) and nuScenes (83.6) datasets (leaderboard rankings: 1st on both datasets).


# 159
Strong Double Blind
KeypointDETR: An End-to-End 3D Keypoint Detector

Hairong Jin · Yuefan Shen · Jianwen Lou · Kun Zhou · Youyi Zheng

3D keypoint detection plays a pivotal role in 3D shape analysis. The majority of prevalent methods depend on producing a shared heatmap. This approach necessitates subsequent post-processing techniques such as clustering or non-maximum suppression (NMS) to pinpoint keypoints within high-confidence regions, resulting in performance inefficiencies. To address this issue, we introduce KeypointDETR, an end-to-end 3D keypoint detection framework. KeypointDETR is predominantly trained with a bipartite matching loss, which compels the network to forecast sets of heatmaps and probabilities for potential keypoints. Each heatmap highlights one keypoint's location, and the associated probability indicates not only the presence of that specific keypoint but also its semantic consistency. Together with the bipartite matching loss, we utilize a transformer-based network architecture, which incorporates both point-wise and query-wise self-attention within the encoder and decoder, respectively. The point-wise encoder leverages self-attention mechanisms on a dynamic graph derived from the local feature space of each point, resulting in the generation of heatmap features. As a key part of our framework, the query-wise decoder not only facilitates inter-query information exchange but also captures the underlying connections among keypoints' heatmaps, positions, and semantic attributes via the cross-attention mechanism, enabling the prediction of heatmaps and probabilities. Extensive experiments conducted on the KeypointNet dataset reveal that KeypointDETR outperforms competitive baselines, demonstrating superior performance in keypoint saliency and correspondence estimation tasks.


# 210
Strong Double Blind
All You Need is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation

Seongho Kim · Byung Cheol Song

With the rise of generative models, multi-modal video generation has gained significant attention, particularly in the realm of audio-driven emotional talking face synthesis. This paper addresses two key challenges in this domain: input bias and intensity saturation. A neutralization scheme is proposed to counter input bias, yielding impressive results in generating neutral talking faces from emotionally expressive ones. Furthermore, 2D continuous emotion label-based regression learning effectively generates varying emotional intensities through frame-wise comparisons. Results from a user study quantify subjective interpretations of strong emotions and naturalness, revealing up to 78.09% higher emotion accuracy and up to 3.41 higher naturalness compared to the lowest-ranked method.


# 215
TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

Jiahe Li · Jiawei Zhang · Xiao Bai · Jin Zheng · Xin Ning · Jun Zhou · Lin Gu

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency than previous methods.


# 213
HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting

Helisa Dhamo · Yinyu Nie · Arthur Moreau · Song Jifei · Richard Shaw · Yiren Zhou · Eduardo Pérez Pellitero

3D head animation has seen major quality and runtime improvements over the last few years, particularly empowered by the advances in differentiable rendering and neural radiance fields. Real-time rendering is a highly desirable goal for real-world applications. We propose HeadGaS, a model that uses 3D Gaussian Splats (3DGS) for 3D head reconstruction and animation. In this paper we introduce a hybrid model that extends the explicit 3DGS representation with a base of learnable latent features, which can be linearly blended with low-dimensional parameters from parametric head models to obtain expression-dependent color and opacity values. We demonstrate that HeadGaS delivers state-of-the-art results in real-time inference frame rates, surpassing baselines by up to 2 dB, while accelerating rendering speed by over x10.


# 214
Strong Double Blind
Stable Video Portraits

Mirela Ostrek · Justus Thies

Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today, especially considering the impeccable images of synthesized lifelike faces. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D single-frame video generation method that outputs photorealistic videos of talking faces leveraging the large pre-trained text-to-image prior Stable Diffusion (2D), controlled via a temporal sequence of 3DMMs (3D). To bridge the gap between 2D and 3D, the 3DMM conditionings are projected onto a 2D image plane and person-specific finetuning is performed using a single identity video via ControlNet. For higher temporal stability, we introduce a novel inference procedure that considers the previous frame during the denoising process. Furthermore, a technique to morph the facial appearance of the person-specific avatar into a text-defined celebrity, without test-time finetuning, is demonstrated. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods. Code and data will be released.


# 223
Strong Double Blind
iHuman: Instant Animatable Digital Humans From Monocular Videos

Pramish Paudel · Anubhav Khanal · Danda Pani Paudel · Jyoti Tandukar · Ajad Chhatkuli

Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations. This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface's displacements and the color's spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses. Our source code will be made publicly available.


# 219
Strong Double Blind
POCA: Post-training Quantization with Temporal Alignment for Codec Avatars

Jian Meng · Yuecheng Li · CHENGHUI Li · Syed Shakib Sarwar · Dilin Wang · Jae-sun Seo

Real-time decoding generates high-quality assets for rendering photorealistic Codec Avatars for immersive social telepresence with AR/VR. However, high-quality avatar decoding incurs expensive computation and memory consumption, which necessitates the design of a decoder compression algorithm (e.g., quantization). Although quantization has been widely studied, the quantization of avatar decoders is an urgent yet under-explored need. Furthermore, the requirement of fast ``User-Avatar'' deployment prioritizes the post-training quantization (PTQ) over the time-consuming quantization-aware training (QAT). As the first work in this area, we reveal the sensitivity of the avatar decoding quality under low precision. In particular, the state-of-the-art (SoTA) QAT and PTQ algorithms introduce massive amounts of temporal noise to the rendered avatars, even with the well-established 8-bit precision. To resolve these issues, a novel PTQ algorithm is proposed for quantizing the avatar decoder with low-precision weights and activation (8-bit and 6-bit), without introducing temporal noise to the rendered avatar. Furthermore, the proposed method only needs 10% of the activations of each layer to calibrate quantization parameters without any distribution manipulations or extensive boundary search. The proposed method is evaluated on various face avatars with different facial characteristics. The proposed method compresses the decoder model by 5.3x while recovering the quality on par with the full precision baseline. In addition to the avatar rendering tasks, POCA is also applicable to image resolution enhancement tasks, achieving new SoTA image quality. The source code of POCA will be released soon.


# 326
Towards Image Ambient Lighting Normalization

Florin-Alexandru Vasluianu · Tim Seizinger · Zongwei Wu · Rakesh Ranjan · Radu Timofte

Lighting normalization is a crucial but underexplored restoration task with broad applications. However, existing works often simplify this task within the context of shadow removal, limiting the light sources to one and oversimplifying the scene, thus excluding complex self-shadows and restricting surface classes to smooth ones. Although promising, such simplifications hinder generalizability to more realistic settings encountered in daily use. In this paper, we propose a new challenging task termed Ambient Lighting Normalization (ALN), which enables the study of interactions between shadows, unifying image restoration and shadow removal in a broader context. To address the lack of appropriate datasets for ALN, we introduce the large-scale high-resolution dataset Ambient6K, comprising samples obtained from multiple light sources and including self-shadows resulting from complex geometries, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on Ambient6K. Additionally, we propose IFBlend, a novel strong baseline that maximizes Image-Frequency joint entropy to selectively restore local areas under different lighting conditions, without relying on shadow localization priors. Experiments show that IFBlend achieves SOTA scores on Ambient6K and exhibits competitive performance on conventional shadow removal benchmarks compared to shadow-specific models with mask priors. The dataset, benchmark, and code are available at https://github.com/fvasluianu97/IFBlend.


# 161
Strong Double Blind
LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models

Hai Jiang · Ao Luo · Xiaohong Liu · Songchen Han · Shuaicheng Liu

In this paper, we propose a diffusion-based unsupervised framework that incorporates physically explainable Retinex theory with diffusion models for low-light image enhancement, named LightenDiffusion. Specifically, we present a content-transfer decomposition network that performs Retinex decomposition within the latent space instead of image space as in previous approaches, enabling the encoded features of unpaired low-light and normal-light images to be decomposed into content-rich reflectance maps and content-free illumination maps. Subsequently, the reflectance map of the low-light image and the illumination map of the normal-light one are taken as input to the diffusion model for unsupervised restoration with the guidance of the low-light feature, where a self-constrained consistency loss is further proposed to eliminate the interference of normal-light content on the restored results to improve overall visual quality. Extensive experiments on publicly available real-world benchmarks show that the proposed LightenDiffusion outperforms state-of-the-art unsupervised competitors and is comparable to supervised ones while being more generalizable to various scenes. Code will be released to facilitate future research.


# 331
Strong Double Blind
Efficient Snapshot Spectral Imaging: Calibration-Free Parallel Structure with Aperture Diffraction Fusion

Tao Lv · Lihao Hu · Shiqiao Li · Chenglong Huang · Xun Cao

Aiming to address the repetitive need for complex calibration in existing snapshot spectral imaging methods, while better trading off system complexity and spectral reconstruction accuracy, we demonstrate a novel Parallel Coded Calibration-free Aperture Diffraction Imaging Spectrometer (PCCADIS) with minimal parallel complexity and enhanced light throughput. The system integrates monochromatic acquisition of diffraction-induced blurred spatial-spectral projections with uncalibrated chromatic filtering for guidance, enabling portable and lightweight implementation. In the inverse process of PCCADIS, aperture diffraction produces blurred patterns through global replication and band-by-band integration, posing challenges in registration with clearly structured RGB information by feature matching. Therefore, the Self-aware Spectral Fusion Cascaded Transformer (SSFCT) is proposed to realize the fusion and reconstruction of uncalibrated input, which demonstrates the potential to substantially improve the accuracy of snapshot spectral imaging while concurrently reducing the associated costs. Our methodology is rigorously evaluated through comprehensive simulation experiments and real reconstruction experiments with a prototype, confirming the high accuracy and user-friendliness of the PCCADIS.


# 340
Strong Double Blind
Physically Plausible Color Correction for Neural Radiance Fields

Qi Zhang · Ying Feng · HONGDONG LI

Neural Radiance Fields have become the representation of choice for many 3D computer vision and graphics applications, e.g., novel view synthesis and 3D reconstruction. Multi-camera systems are commonly used as the image capture setup for NeRF-based multi-view tasks such as dynamic scene acquisition or realistic avatar animation. However, a critical issue that has often been overlooked in this setup is the evident differences in color responses among multiple cameras, which adversely affect the NeRF reconstruction performance. These color discrepancies among multiple input images stem from two aspects: 1) implicit properties of the scenes such as reflections and shadings, and 2) external differences in camera settings and lighting conditions. In this paper, we address this problem by proposing a novel color correction module that simulates the physical color processing in cameras to be embedded in NeRF, enabling the unified color NeRF reconstruction. Besides the view-independent color correction module for external differences, we predict a view-dependent function to minimize the color residual (including, e.g., specular and shading) to eliminate the impact of inherent attributes. We further describe how the method can be extended with a reference image as guidance to achieve aesthetically plausible color consistency and color translation on novel views. Experiments validate that our method is superior to baseline methods in both quantitative and qualitative evaluations of color correction and color consistency.


# 49
DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images

Zaid Tasneem · Akshat Dave · Abhishek Singh · Kushagra Tiwary · Praneeth Vepakomma · Ashok Veeraraghavan · Ramesh Raskar

Neural radiance fields (NeRFs) show potential for transforming images captured worldwide into immersive 3D visual experiences. However, most of this captured visual data remains siloed in our camera rolls as these images contain personal details. Even if made public, the problem of learning 3D representations of billions of scenes captured daily in a centralized manner is computationally intractable. Our approach, DecentNeRF, is the first attempt at decentralized, crowd-sourced NeRFs that require $\sim 10^4\times$ less server computing for a scene than a centralized approach. Instead of sending the raw data, our approach requires users to send a 3D representation, distributing the high computation cost of training centralized NeRFs between the users. It learns photorealistic scene representations by decomposing users' 3D views into personal and global NeRFs and a novel optimally weighted aggregation of only the latter. We validate the advantage of our approach to learn NeRFs with photorealism and minimal server computation cost on structured synthetic and real-world photo tourism datasets. We further analyze how secure aggregation of global NeRFs in DecentNeRF minimizes the undesired reconstruction of personal content by the server.


# 344
Volumetric Rendering with Baked Quadrature Fields

Gopal Sharma · Daniel Rebain · Kwang Moo Yi · Andrea Tagliasacchi

We propose a novel Neural Radiance Field (NeRF) representation for non-opaque scenes that allows fast inference by utilizing textured polygons. Despite the high-quality novel view rendering that NeRF provides, a critical limitation is that it relies on volume rendering that can be computationally expensive and does not utilize the advancements in modern graphics hardware. Many existing methods for this problem fall short when it comes to modelling volumetric effects as they rely purely on surface rendering. We thus propose to model the scene with polygons, which can then be used to obtain the quadrature points required to model volumetric effects, and also their opacity and colour from the texture. To obtain such polygonal mesh, we train a specialized field whose zero-crossings would correspond to the quadrature points when volume rendering, and perform marching cubes on this field. We then perform ray-tracing and utilize the fragment shaders to obtain the final colour image. Our method allows an easy integration with existing graphics frameworks allowing rendering speed of over 100 frames-per-second for a $1920\times1080$ image, while still being able to represent non-opaque objects.


# 338
Depth-guided NeRF Training via Earth Mover’s Distance

Anita Rau · Josiah Aklilu · Floyd C Holsinger · Serena Yeung-Levy

Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover's Distance instead of enforcing the rendered depth to replicate the depth prior exactly through L2-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.


# 339
RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF

Sibi Catley-Chandar · Richard Shaw · Greg Slabaugh · Eduardo Pérez Pellitero

Recent advances in neural rendering have enabled highly photorealistic 3D scene reconstruction and novel view synthesis. Despite this, current state of the art methods struggle to reconstruct high frequency detail, due to factors such as a low-frequency bias of radiance fields and inaccurate camera calibration. One approach to mitigate this issue is to enhance images post-rendering. 2D enhancers can be pre-trained to recover some detail but are agnostic to scene geometry and do not easily generalize to new distributions of image degradation. Conversely, existing 3D enhancers are able to transfer detail from nearby training images in a generalizable manner, but suffer from inaccurate camera calibration and can propagate errors from the geometry into image renderings. We propose a neural rendering enhancer, RoGUENeRF, which exploits the best of both worlds. Our method is pre-trained to learn a general enhancer while also leveraging information from nearby training images via robust 3D alignment and geometry-aware fusion. Our approach restores high-frequency textures while maintaining geometric consistency and is also robust to inaccurate camera calibration. We show that RoGUENeRF significantly enhances the rendering quality of a wide range of neural rendering baselines, e.g. improving the PSNR of MipNeRF360 by 0.63dB and Nerfacto by 1.34dB on the real world 360v2 dataset.


# 342
Deblurring 3D Gaussian Splatting

Byeonghyeon Lee · Howoong Lee · Xiangyu Sun · Usman Ali · Eunbyung Park

Recent studies in Radiance Fields have paved the robust way for novel view synthesis with their photorealistic rendering quality. Nevertheless, they usually employ neural networks and volumetric rendering, which are costly to train and impede their broad use in various real-time applications due to the lengthy rendering time. Lately 3D Gaussians splatting-based approach has been proposed to model the 3D scene, and it achieves remarkable visual quality while rendering the images in real-time. However, it suffers from severe degradation in the rendering quality if the training images are blurry. Blurriness commonly occurs due to the lens defocusing, object motion, and camera shake, and it inevitably intervenes in clean image acquisition. Several previous studies have attempted to render clean and sharp images from blurry input images using neural fields. The majority of those works, however, are designed only for volumetric rendering-based neural radiance fields and are not straightforwardly applicable to rasterization-based 3D Gaussian splatting methods. Thus, we propose a novel real-time deblurring framework, Deblurring 3D Gaussian Splatting, using a small Multi-Layer Perceptron (MLP) that manipulates the covariance of each 3D Gaussian to model the scene blurriness. While Deblurring 3D Gaussian Splatting can still enjoy real-time rendering, it can reconstruct fine and sharp details from blurry images. A variety of experiments have been conducted on the benchmark, and the results have revealed the effectiveness of our approach for deblurring.


# 341
Strong Double Blind
Distractor-Free Novel View Synthesis via Exploiting Memorization Effect in Optimization

Yukun Wang · Kunhong Li · Minglin Chen · Longguang Wang · Shunbo Zhou · Kaiwen Xue · Yulan Guo

Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly advanced novel view synthesis, which is capable of photo-realistic rendering. However, these methods require the foundational assumption of the static scene (e.g., consistent lighting condition and persistent object positions), which is often violated in real-world scenarios. In this study, we introduce MemE, an unsupervised plug-and-play module, to achieve high-quality novel view synthesis in noisy input scenarios. MemE leverages the inherent property in parameter optimization, known as the memorization effect to achieve distractor filtering and can be easily combined with NeRF or 3DGS. Furthermore, MemE is applicable in environments both with and without distractors, significantly enhancing the adaptability of NeRF and 3DGS across diverse input scenarios. Extensive experiments show that our methods (i.e., MemE-NeRF and MemE-3DGS) achieve state-of-the-art performance on both real and synthetic noisy scenes.


# 343
TriNeRFLet: A Wavelet Based Triplane NeRF Representation

Rajaei Khatib · RAJA GIRYES

In recent years, the neural radiance field (NeRF) model has gained popularity due to its ability to recover complex 3D scenes. Following its success, many approaches proposed different NeRF representations in order to further improve both runtime and performance. One such example is Triplane, in which NeRF is represented using three 2D feature planes. This enables easily using existing 2D neural networks in this framework, e.g., to generate the three planes. Despite its advantage, the triplane representation lagged behind in 3D recovery quality compared to NeRF solutions. In this work, we propose the TriNeRFLet framework, where we learn the wavelet representation of the triplane and regularize it. This approach has multiple advantages: (i) it allows information sharing across scales and regularization of high frequencies; (ii) it facilitates performing learning in a multi-scale fashion; and (iii) it provides a `natural' framework for performing NeRF super-resolution (SR), such that the low-resolution wavelet coefficients are computed from the provided low-resolution multi-view images and the high frequencies are acquired under the guidance of a pre-trained 2D diffusion model. We show the SR approach's advantage on both Blender and LLFF datasets.


# 337
Strong Double Blind
LaRa: Efficient Large-Baseline Radiance Fields

Anpei Chen · Haofei Xu · Stefano Esposito · Siyu Tang · Andreas Geiger

Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction. But they are mostly applied in per-scene optimization or small-baseline settings. While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction. We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results show significant improvement over previous work in reconstructing both appearance and geometry, and robustness to zero-shot and out-of-domain testing. Our code and models will be made publicly available.


# 327
RANRAC: Robust Neural Scene Representations via Random Ray Consensus

Benno Buschmann · Andreea Dogaru · Elmar Eisemann · Michael Weinmann · Bernhard Egger

Learning-based scene representations such as neural radiance fields or light field networks, that rely on fitting a scene model to image observations, commonly encounter challenges in the presence of inconsistencies within the images caused by occlusions, inaccurately estimated camera parameters or effects like lens flare. To address this challenge, we introduce RANdom RAy Consensus (RANRAC), an efficient approach to eliminate the effect of inconsistent data, thereby taking inspiration from classical RANSAC based outlier detection for model fitting. In contrast to the down-weighting of the effect of outliers based on robust loss formulations, our approach reliably detects and excludes inconsistent perspectives, resulting in clean images without floating artifacts. For this purpose, we formulate a fuzzy adaption of the RANSAC paradigm, enabling its application to large scale models. We interpret the minimal number of samples to determine the model parameters as a tunable hyperparameter, investigate the generation of hypotheses with data-driven models, and analyse the validation of hypotheses in noisy environments. We demonstrate the compatibility and potential of our solution for both photo-realistic robust multi-view reconstruction from real-world images based on neural radiance fields and for single-shot reconstruction based on light-field networks. In particular, the results indicate significant improvements compared to state-of-the-art robust methods for novel-view synthesis on both synthetic and captured scenes with various inconsistencies including occlusions, noisy camera pose estimates, and unfocused perspectives. The results further indicate significant improvements for single-shot reconstruction from occluded images.


# 336
Strong Double Blind
SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization

Mae Younes · Amine Ouasfi · Adnane Boukhayma

We present a novel approach for recovering 3D shape and view dependent appearance from a few colored images, enabling efficient 3D reconstruction and novel view synthesis. Our method learns an implicit neural representation in the form of a Signed Distance Function (SDF) and a radiance field. The model is trained progressively through ray tracing enabled volumetric rendering, and regularized with learning-free multi-view stereo (MVS) cues. Key to our contribution is a novel implicit neural shape function learning strategy that encourages our SDF field to be as linear as possible near the level-set, hence robustifying the training against noise emanating from the supervision and regularization signals. Without using any pretrained priors, our method, called SparseCraft, achieves state-of-the-art performances both in novel-view synthesis and reconstruction from sparse views in standard benchmarks, while requiring less than 10 minutes for training.


# 324
Strong Double Blind
Learning Representations from Foundation Models for Domain Generalized Stereo Matching

Yongjian Zhang · Longguang Wang · Kunhong Li · WANG Yun · Yulan Guo

State-of-the-art stereo matching networks trained on in-domain data often underperform on cross-domain scenes. Intuitively, leveraging the zero-shot capacity of a foundation model can alleviate the cross-domain generalization problem. The main challenge of incorporating a foundation model into stereo matching pipeline lies in the absence of an effective forward process from single-view coarse-grained tokens to cross-view fine-grained cost representations. In this paper, we propose FormerStereo, a general framework that integrates the Vision Transformer (ViT) based foundation model into the stereo matching pipeline. Using this framework, we transfer the all-purpose features to matching-specific ones. Specifically, we propose a reconstruction-constrained decoder to retrieve fine-grained representations from coarse-grained ViT tokens. To maintain cross-view consistent representations, we propose a cosine-constrained concatenation cost (C4) space to construct cost volumes. We integrate FormerStereo with state-of-the-art (SOTA) stereo matching networks and evaluate its effectiveness on multiple benchmark datasets. Experiments show that the FormerStereo framework effectively improves the zero-shot performance of existing stereo matching networks on unseen domains and achieves SOTA performance.


# 330
CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

Jiawei Zhang · Jiahe Li · Xiaohan Yu · Lei Huang · Lin Gu · Jin Zheng · Xiao Bai

3D Gaussian Splatting (3DGS) creates a radiance field consisting of 3D Gaussians to represent a scene. With sparse training views, 3DGS easily suffers from overfitting, negatively impacting the reconstruction quality. This paper introduces a new co-regularization perspective for improving sparse-view 3DGS. When training two 3D Gaussian radiance fields with the same sparse views of a scene, we observe that the two radiance fields exhibit point disagreement and rendering disagreement that can unsupervisedly predict reconstruction quality, stemming from the sampling implementation in densification. We further quantify the point disagreement and rendering disagreement by evaluating the registration between Gaussians' point representations and calculating differences in their rendered pixels. The empirical study demonstrates the negative correlation between the two disagreements and accurate reconstruction, which allows us to identify inaccurate reconstruction without accessing ground-truth information. Based on the study, we propose CoR-GS, which identifies and suppresses inaccurate reconstruction based on the two disagreements: (1) Co-pruning considers Gaussians that exhibit high point disagreement in inaccurate positions and prunes them. (2) Pseudo-view co-regularization considers pixels that exhibit high rendering disagreement are inaccurately rendered and suppress the disagreement. Results on LLFF, Mip-NeRF360, DTU, and Blender demonstrate that CoR-GS effectively regularizes the scene geometry, reconstructs the compact representations, and achieves state-of-the-art novel view synthesis quality under sparse training views.


# 319
CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

Jiarui Hu · Xianhao Chen · Boyin Feng · Guanglin Li · Liangjing Yang · Hujun Bao · Guofeng Zhang · Zhaopeng Cui

Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and mapping (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and mapping. Additionally, a novel depth uncertainty model is proposed to ensure the selection of valuable Gaussian primitives during optimization, thereby improving tracking efficiency and accuracy. Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and mapping performance with a notable tracking speed of up to 15 Hz. We will make our source code publicly available.


# 328
Strong Double Blind
SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

Marko Mihajlovic · Sergey Prokudin · Siyu Tang · Robert Maier · Federica Bogo · Tony Tung · Edmond Boyer

Digitizing static scenes and dynamic events from multi-view images has long been a challenge in the fields of computer vision and graphics. Recently, 3D Gaussian Splatting has emerged as a practical and scalable method for reconstruction, gaining popularity due to its impressive quality of reconstruction, real-time rendering speeds, and compatibility with widely used visualization tools. However, the method requires a substantial number of input views to achieve high-quality scene reconstruction, introducing a significant practical bottleneck. This challenge is especially pronounced in capturing dynamic scenes, where deploying an extensive camera array can be prohibitively costly. In this work, we identify the lack of spatial autocorrelation as one of the factors contributing to the suboptimal performance of the 3DGS technique in sparse reconstruction settings. To address the issue, we propose an optimization strategy that effectively regularizes splat features by modeling them as the outputs of a corresponding implicit neural field. This results in a consistent enhancement of reconstruction quality across various scenarios. Our approach adeptly manages both static and dynamic cases, as demonstrated by extensive testing across different setups and scene complexities.


# 329
On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy

Letian Huang · Jiayang Bai · Jie Guo · Yuanqi Li · Yanwen Guo

3D Gaussian Splatting has garnered extensive attention and application in real-time neural rendering. Concurrently, concerns have been raised about the limitations of this technology in aspects such as point cloud storage, performance, and robustness in sparse viewpoints, leading to various improvements. However, there has been a notable lack of attention to the fundamental problem of projection errors introduced by the local affine approximation inherent in the splatting itself, and the consequential impact of these errors on the quality of photo-realistic rendering. This paper addresses the projection error function of 3D Gaussian Splatting, commencing with the residual error from the first-order Taylor expansion of the projection function. The analysis establishes a correlation between the error and the Gaussian mean position. Subsequently, leveraging function optimization theory, this paper analyzes the function's minima to provide an optimal projection strategy for Gaussian Splatting referred to Optimal Gaussian Splatting, which can accommodate a variety of camera models. Experimental validation further confirms that this projection methodology reduces artifacts, resulting in a more convincingly realistic rendering.


# 334
Revising Densification in Gaussian Splatting

Samuel Rota Bulò · Lorenzo Porzi · Peter Kontschieder

In this paper, we address the limitations of Adaptive Density Control (ADC) in 3D Gaussian Splatting (3DGS), a scene representation method achieving high-quality, photorealistic results for novel view synthesis. ADC has been introduced for automatic 3D point primitive management, controlling densification and pruning, however, with certain limitations in the densification logic. Our main contribution is a more principled, pixel-error driven formulation for density control in 3DGS, leveraging an auxiliary, per-pixel error function as the criterion for densification. We further introduce a mechanism to control the total number of splats generated per scene and correct a bias in the current opacity handling in ADC during primitive cloning. Our approach leads to consistent quality improvements across a variety of benchmark scenes, without sacrificing the method’s efficiency.


# 325
Strong Double Blind
MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation

Shuzhao Xie · Weixiang Zhang · Chen Tang · Yunpeng Bai · Rongwei Lu · Shjia Ge · Zhi Wang

3D Gaussian Splatting demonstrates excellent quality and speed in novel view synthesis. Nevertheless, the significant size of the 3D Gaussians presents challenges for transmission and storage. Current approaches employ compact models to compress the substantial volume and attributes of 3D Gaussians, along with intensive training to uphold quality. These endeavors demand considerable finetuning time, presenting formidable hurdles for practical deployment. To this end, we propose \emph{MesonGS}, a codec for post-training compression of 3D Gaussians. Initially, we introduce a measurement criterion that considers both view-dependent and view-independent factors to assess the impact of each Gaussian point on the rendering output, enabling the removal of insignificant points. Subsequently, we decrease the entropy of attributes through two transformations that complement subsequent entropy coding techniques to enhance the file compression rate. More specifically, we first replace the rotation quaternion with Euler angles; then, we apply region adaptive hierarchical transform (RAHT) to key attributes to reduce entropy. Lastly, we suggest block quantization to control quantization granularity, thereby avoiding excessive information loss caused by quantization. Moreover, a finetune scheme is introduced to restore quality. Extensive experiments demonstrate that MesonGS significantly reduces the size of 3D Gaussians while preserving competitive quality. Notably, our method can achieve better compression quality than fine-tuned concurrent methods without additional retraining on the Mip-NeRF 360 dataset.


# 224
Strong Double Blind
Topology-Preserving Downsampling of Binary Images

Chia-Chia Chen · Chi-Han Peng

We present a novel discrete optimization-based approach to generate downsampled versions of binary images that are guaranteed to have the same topology as the original, measured by the zeroth and first Betti numbers of the black regions, while having good similarity to the original image as measured by IoU and Dice scores. To our best knowledge, all existing binary image downsampling methods don't have such topology-preserving guarantees. We also implemented a baseline morphological operation (dilation)-based approach that always generates topologically correct results. However, we found the similarity scores to be much worse. We demonstrate several applications of our approach. First, generating smaller versions of medical image segmentation masks for easier human inspection. Second, improving the efficiency of binary image operations, including persistent homology computation and shortest path computation, by substituting the original images with smaller ones. In particular, the latter is a novel application that is made feasible only by the full topology-preservation guarantee of our method.


# 313
Zero-Shot Multi-Object Scene Completion

Shun Iwase · Katherine Liu · Vitor Guizilini · Adrien Gaidon · Kris Kitani · Rares Ambrus · Sergey Zakharov

We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image. Despite notable advancements in single-object 3D shape completion, high-quality reconstructions in highly cluttered real-world multi-object scenes remains a challenge. To address this issue, we propose OctMAE, an architecture that leverages an Octree U-Net and a latent 3D MAE to achieve high-quality and near real-time multi-object scene completion through both local and global geometric reasoning. Because a naive 3D MAE can be computationally intractable and memory intensive even in the latent space, we introduce a novel occlusion masking strategy and adopt 3D rotary embeddings, which significantly improves the runtime and scene completion quality. To generalize to a wide range of objects in diverse scenes, we create a large-scale photorealistic dataset, featuring a diverse set of 12K 3D object models from the Objaverse dataset which are rendered in multi-object scenes with physics-based positioning. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets and demonstrates a strong zero-shot capability.


# 225
Strong Double Blind
PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance

Aoming Liu · Zhong Li · Zhang Chen · Nannan Li · Yi Xu · Bryan Plummer

Immersive scene generation, notably panorama creation, benefits significantly from the adaptation of large pre-trained text-to-image (T2I) models for multi-view image generation. Due to the high cost of acquiring multi-view images, tuning-free generation is preferred. However, existing methods are either limited by simple correspondences or require extensive fine-tuning for complex ones. We present PanoFree, a novel method for tuning-free multi-view image generation that supports an extensive array of correspondences. PanoFree sequentially generates multi-view images using iterative warping and inpainting, addressing the key issues of inconsistency and artifacts due to error accumulation without the need for fine-tuning. It improves error accumulation by enhancing cross-view awareness and refining the warping and inpainting processes through cross-view guidance, risky area estimation and erasing, and symmetric bidirectional guided generation for loop closure, alongside guidance-based semantic and density control for scene structure preservation. Evaluated on various panorama types—Planar, 360°, and Full Spherical Panoramas—PanoFree demonstrates significant error reduction, improved global consistency, and image quality across different scenarios without extra fine-tuning. Compared to existing methods, PanoFree is up to 5x more efficient in time and 3x more efficient in GPU memory usage, and maintains superior diversity of results (66.7% vs. 33.3% in our user study). PanoFree offers a viable alternative to costly fine-tuning or the use of additional pre-trained models.


# 209
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

Junlin Han · Filippos Kokkinos · Philip Torr

This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.


# 335
Strong Double Blind
Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction

Dian Jia · Xiaoqian Ruan · Kun Xia · Zhiming Zou · Le Wang · Wei Tang

Deep learning approaches have made significant success in single-view 3D reconstruction, but they often rely on expensive 3D annotations for training. Recent efforts tackle this challenge by adopting an analysis-by-synthesis paradigm to learn 3D reconstruction with only 2D annotations. However, existing methods face limitations in both shape reconstruction and texture generation. This paper introduces an innovative Analysis-by-Synthesis Transformer that addresses these limitations in a unified framework by effectively modeling pixel-to-shape and pixel-to-texture relationships. It consists of a Shape Transformer and a Texture Transformer. The Shape Transformer employs learnable shape queries to fetch pixel-level features from the image, thereby achieving high-quality mesh reconstruction and recovering occluded vertices. The Texture Transformer employs texture queries for non-local gathering of texture information and thus eliminates incorrect inductive bias. Experimental results on CUB-200-2011 and ShapeNet datasets demonstrate superior performance in shape reconstruction and texture generation compared to previous methods. We will make code publicly available.


# 314
Strong Double Blind
Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping

Minseong Park · Suhan Woo · Euntai Kim

Learning efficient representations of local feature is a key challenge in feature volume-based 3D neural mapping, especially in large-scale environments. In this paper, we introduce Decomposition-based Neural Mapping (DNMap), a storage-efficient large-scale 3D mapping method employing discrete representation based on decomposition strategy. This decomposition strategy aims to efficiently capture repetitive and representative patterns of shapes by decomposing each discrete embedding into component vectors that are shared across the embedding space. Our DNMap optimizes a set of component vectors, rather than entire discrete embeddings, and learns composition rather than indexing the discrete embeddings. Furthermore, to complement the mapping quality, we additionally learn low-resolution continuous embeddings that require tiny storage space. By combining these representations with a shallow neural network and an efficient octree-based feature volume, our DNMap successfully approximates signed distance function and an efficient octree-based feature volume, our DNMap successfully compress the feature volume while preserving mapping quality.


# 308
Strong Double Blind
COSMU: Complete 3D human shape from monocular unconstrained images

Marco Pesavento · Marco Volino · Adrian Hilton

We present a novel framework to reconstruct complete 3D human shapes from a given target image by leveraging monocular unconstrained images. The objective of this work is to reproduce high-quality details in regions of the reconstructed human body that are not visible in the input target. The proposed methodology addresses the limitations of existing approaches for reconstructing 3D human shapes from a single image, which cannot reproduce shape details in occluded body regions. The missing information of the monocular input can be recovered by using multiple views captured from multiple cameras. However, multi-view reconstruction methods necessitate accurately calibrated and registered images, which can be challenging to obtain in real-world scenarios. Given a target RGB image and a collection of multiple uncalibrated and unregistered images of the same individual, acquired using a single camera, we propose a novel framework to generate complete 3D human shapes. We introduce a novel module to generate 2D multi-view normal maps of the person registered with the target input image. The module consists of body part-based reference selection and body part-based registration. The generated 2D normal maps are then processed by a multi-view attention-based neural implicit model that estimates an implicit representation of the 3D shape, ensuring the reproduction of details in both observed and occluded regions. Extensive experiments demonstrate that the proposed approach estimates higher quality details in the occluded regions of the 3D clothed human shapes from a set of unconstrained images compared to related methods, without the use of parametric models.


# 318
Strong Double Blind
MeshFeat: Multi-Resolution Features for Neural Fields on Meshes

Mihir Mahajan · Florian Hofherr · Daniel Cremers

Parametric feature grid encodings have gained significant attention as an encoding approach for intrinsic neural fields since they allow for much smaller MLPs which decreases the inference time of the models significantly. In this work, we propose MeshFeat, a parametric feature encoding tailored to meshes, for which we adapt the idea of multi-resolution feature grids from Euclidean space. We start from the structure provided by the given vertex topology and use a mesh simplification algorithm to construct a multi-resolution feature representation directly on the mesh. The approach allows the usage of small MLPs for neural fields on meshes, and we show a significant speed-up compared to previous representations while maintaining comparable reconstruction quality for texture reconstruction and BRDF representation. Given its intrinsic coupling to the vertices, the method is particularly well-suited for representations on deforming meshes, making it a good fit for object animation.


# 222
Real-time 3D-aware Portrait Editing from a Single Image

Qingyan Bai · Zifan Shi · Yinghao Xu · Hao Ouyang · Qiuyu Wang · Ceyuan Yang · Xuan Wang · Gordon Wetzstein · Yujun Shen · Qifeng Chen

This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our system achieves real-time editing with a feedforward network (i.e., ∼0.04s per image), over 100× faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference (e.g., with ∼5min fine-tuning per style). The code, the model, and the interface will be made publicly available to facilitate future research.


# 226
An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes

Zhengyi Zhao · Chen Song · Xiaodong Gu · Yuan Dong · Qi Zuo · Weihao Yuan · Zilong Dong · Liefeng Bo · Qixing Huang

A fundamental problem in the texturing of 3D meshes using pre-trained text-to-image models is to ensure multi-view consistency. State-of-the-art approaches typically use diffusion models to aggregate multi-view inputs, where common issues are the blurriness caused by the averaging operation in the aggregation step or inconsistencies in local features. This paper introduces an optimization framework that proceeds in four stages to achieve multi-view consistency. Specifically, the first stage generates an over-complete set of 2D textures from a predefined set of viewpoints using an MV-consistent diffusion process. The second stage selects a subset of views that are mutually consistent while covering the underlying 3D model. We show how to achieve this goal by solving semi-definite programs. The third stage performs non-rigid alignment to align the selected views across overlapping regions. The fourth stage solves an MRF problem to associate each mesh face with a selected view. In particular, the third and fourth stages are iterated, with the cuts obtained in the fourth stage encouraging non-rigid alignment in the third stage to focus on regions close to the cuts. Experimental results show that our approach significantly outperforms baseline approaches both qualitatively and quantitatively.


# 218
RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Qi Wang · Ruijie Lu · Xudong XU · Jingbo Wang · Michael Yu Wang · Bo Dai · Gang Zeng · Dan Xu

The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes.In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency.In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted.Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input.


# 212
Scene-Conditional 3D Object Stylization and Composition

Jinghao Zhou · Tomas Jakab · Philip Torr · Christian Rupprecht

Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene, and additionally produce a photorealistic composition as if the asset was placed within the environment. This not only opens up a new level of control for object stylization, for example, the same assets can be stylized to reflect changes in the environment, such as summer to winter or fantasy versus futuristic settings—but also makes the object-scene composition more controllable. We achieve this by combining modeling and optimizing the object's texture and environmental lighting through differentiable ray tracing with image priors from pre-trained text-to-image diffusion models. We demonstrate that our method is applicable to a wide variety of indoor and outdoor scenes and arbitrary objects.


# 216
DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling

Haoran Li · Haolin Shi · Wenli Zhang · Wenjun Wu · Yong Liao · LIN WANG · Lik-Hang Lee · Peng Yuan Zhou

Text-to-3D scene generation holds immense potential for the gaming, film, and architecture sectors, increasingly capturing the attention of both academic and industry circles. Despite significant progress, current methods still struggle with maintaining high quality, consistency, and editing flexibility. In this paper, we propose DreamScene, a 3D Gaussian-based novel text-to-3D scene generation framework that leverages Formation Pattern Sampling (FPS) for core structuring, augmented with a strategic camera sampling and supported by holistic object-environment integration to overcome these hurdles. FPS, guided by the formation patterns of 3D objects, employs multi-timesteps sampling to quickly form semantically rich, high-quality representations, uses 3D Gaussian filtering for optimization stability, and leverages reconstruction techniques to generate plausible textures. The camera sampling strategy incorporates a progressive three-stage approach, specifically designed for both indoor and outdoor settings, to effectively ensure scene-wide 3D consistency. DreamScene enhances scene editing flexibility by combining objects and environments, enabling targeted adjustments. Extensive experiments showcase DreamScene's superiority over current state-of-the-art techniques, heralding its wide-ranging potential for diverse applications. Our project code will be made publicly available.


# 189
BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Gwanghyun Kim · Hayeon Kim · Hoigi Seo · Dong Un Kang · Se Young Chun

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining.


# 207
Strong Double Blind
Chains of Diffusion Models

Yanheng Wei · Lianghua Huang · Zhi-Fan Wu · Wei Wang · Yu Liu · Mingda Jia · Shuailei Ma

Recent generative models excel in creating high-quality single-human images but fail in complex multi-human scenarios, failing to capture accurate structural details like quantities, identity accuracy, layouts and postures. We introduce a novel approach, Chains, which enhances initial text prompts into detailed human conditions using a step-by-step process. Chains utilize a series of condition nodes—text, quantity, layout, skeleton, and 3D mesh—each undergoing an independent diffusion process. This enables high-quality human generation and advanced scene layout management in diffusion models. We evaluate Chains against a new benchmark for complex multi-human scene synthesis, showing superior performance in human quality and scene accuracy over existing methods. Remarkably, Chains achieves this with under 0.45 seconds for a 20-step inference, demonstrating both effectiveness and efficiency.


# 317
NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

Ruikai Cui · Weizhe Liu · Weixuan Sun · Senbo Wang · Taizhang Shang · Yang Li · Xibin Song · Han Yan · ZHENNAN WU · Shenzhou Chen · HONGDONG LI · Pan Ji

3D shape generation aims to produce innovative 3D content adhering to specific conditions and constraints. Existing methods often decompose 3D shapes into a sequence of localized components, treating each element in isolation without considering spatial consistency. As a result, these approaches exhibit limited versatility in 3D data representation and shape generation, hindering their ability to generate highly diverse 3D shapes that comply with the specified constraints. In this paper, we introduce a novel spatial-aware 3D shape generation framework that leverages 2D plane representations for enhanced 3D shape modeling. To ensure spatial coherence and reduce memory usage, we incorporate a hybrid shape representation technique that directly learns a continuous signed distance field representation of the 3D shape using orthogonal 2D planes. Additionally, we meticulously enforce spatial correspondences across distinct planes using a transformer-based autoencoder structure, promoting the preservation of spatial relationships in the generated 3D shapes. This yields an algorithm that consistently outperforms state-of-the-art 3D shape generation methods on various tasks, including unconditional shape generation, multi-modal shape completion, single-view reconstruction, and text-to-shape synthesis.


# 316
Strong Double Blind
Learning Neural Deformation Representation for 4D Dynamic Shape Generation

Gyojin Han · Jiwan Hur · Jaehyun Choi · Junmo Kim

Recent developments in 3D shape representation opened new possibilities for generating detailed 3D shapes. Despite these advances, there are few studies dealing with the generation of 4D dynamic shapes that have the form of 3D objects deforming over time. To bridge this gap, we focus on generating 4D dynamic shapes with an emphasis on both generation quality and efficiency in this paper. HyperDiffusion, a previous work on 4D generation, proposed a method of directly generating the weight parameters of 4D occupancy fields but suffered from low temporal consistency and slow rendering speed due to motion representation that is not separated from the shape representation of 4D occupancy fields. Therefore, we propose a new neural deformation representation and combine it with conditional neural signed distance fields to design a 4D representation architecture in which the motion latent space is disentangled from the shape latent space. The proposed deformation representation, which works by predicting skinning weights and transformation matrices for multiple parts, also has advantages over the deformation modules of existing 4D representations in understanding the structure of shapes. In addition, we design a training process of a diffusion model that utilizes the shape and motion features that are extracted by our 4D representation as data points. The results of unconditional generation, conditional generation, and motion retargeting experiments demonstrate that our method not only shows better performance than previous works in 4D dynamic shape generation but also has various potential applications. The project page for this paper is available at https://anonymous-pp.github.io/.


# 200
Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Choi Yisol · Sangkyung Kwak · Kyungmin Lee · Hyungwon Choi · Jinwoo Shin

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page.


# 235
Strong Double Blind
Towards High-Quality 3D Motion Transfer with Realistic Apparel Animation

Rong Wang · Wei Mao · Changsheng Lu · HONGDONG LI

Animating stylized characters to match a reference motion sequence is a highly demanded task in film and gaming industries. Existing methods mostly focus on rigid deformations of characters' body, neglecting local deformations on the apparel driven by physical dynamics. They deform apparel the same way as the body, leading to results with limited details and unrealistic artifacts, e.g. body-apparel penetration. In contrast, we present a novel method aiming for high-quality motion transfer with realistic apparel animation. As existing datasets lack annotations necessary for generating realistic apparel animations, we build a new dataset named MMDMC, which combines stylized characters from the MikuMikuDance community with real-world Motion Capture data. We then propose a data-driven pipeline that learns to disentangle body and apparel deformations via two neural deformation modules. For body parts, we propose a geodesic attention block to effectively incorporate semantic priors into skeletal body deformation to tackle complex body shapes for stylized characters. Since apparel motion can significantly deviate from respective body joints, we propose to model apparel deformation in a non-linear vertex displacement field conditioned on its historic states. Extensive experiments show that our method not only produces superior results but also generalizes to various types of apparel.


# 211
GIVT: Generative Infinite-Vocabulary Transformers

Michael Tschannen · Cian Eastwood · Fabian Mentzer

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized latent sequences of a \beta-VAE. GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT in class-conditional image generation, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework.


# 320
Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians

Licheng Zhong · Hong-Xing Yu · Jiajun Wu · Yunzhu Li

Reconstructing and simulating elastic objects from visual observations is crucial for applications in computer vision and robotics. Existing methods, such as 3D Gaussians, provide modeling for 3D appearance and geometry but lack the ability to simulate physical properties or optimize parameters for heterogeneous objects. We propose Spring-Gaus, a novel framework that integrates 3D Gaussians with physics-based simulation for reconstructing and simulating elastic objects from multi-view videos. Our method utilizes a 3D Spring-Mass model, enabling the optimization of physical parameters at the individual point level while decoupling the learning of physics and appearance. This approach achieves great sample efficiency, enhances generalization, and reduces sensitivity to the distribution of simulation particles. We evaluate Spring-Gaus on both synthetic and real-world datasets, demonstrating accurate reconstruction and simulation of elastic objects. This includes future prediction and simulation under varying initial states and environmental parameters.


# 221
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu · Chia-Chih Chen · Zeyuan Chen · Rui Meng · Gang Wu · Paul W Josel · Juan Carlos Niebles · Caiming Xiong · Ran Xu

Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs is skill-demanding, time-consuming, and non-scalable to batch production. Generative models emerge to make design automation scalable but it remains non-trivial to produce designs that comply with designers' multimodal desires, i.e., constrained by background images and driven by foreground content. We propose LayoutDETR that inherits the high quality and realism from generative modeling, while reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal foreground elements in a layout. Our solution sets a new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ad banner dataset. We integrate our solution into a graphical system that facilitates user studies, and show that users prefer our designs over baselines by significant margins. Code and demo video of the graphical system are in the supplementary material. We will release our models and dataset.


# 242
ZigMa: A DiT-style Zigzag Mamba Diffusion Model

Tao Hu · Stefan Andreas Baumann · Ming Gui · Olga Grebenkova · Pingchuan Ma · Johannes Schusterbauer-Fischer · Bjorn Ommer

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$. Code will be available at https://anonymous.4open.science/r/sim_anonymous-4C27/README.md


# 289
Strong Double Blind
Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems

Hyungjin Chung · Jong Chul Ye

Recent inverse problem solvers that leverage generative diffusion priors have garnered significant attention due to their exceptional quality. However, adaptation of the prior is necessary when there exists a discrepancy between the training and testing distributions. In this work, we propose deep diffusion image prior (DDIP), which generalizes the recent adaptation method of SCD by introducing a formal connection to the deep image prior. Under this framework, we propose an efficient adaptation method dubbed D3IP, specified for 3D measurements, which accelerates DDIP by orders of magnitude while achieving superior performance. D3IP enables seamless integration of 3D inverse solvers and thus leads to coherent 3D reconstruction. Moreover, we show that meta-learning techniques can also be applied to yield even better performance. We show that our method is capable of solving diverse 3D reconstructive tasks from the generative prior trained only with phantom images that are vastly different from the training set, opening up new opportunities of applying diffusion inverse solvers even when training with gold standard data is impossible.


# 312
Strong Double Blind
Neural Surface Detection for Unsigned Distance Fields

Federico Stella · Nicolas Talabot · Hieu Le · Pascal Fua

Extracting surfaces from Signed Distance Fields (SDFs) can be accomplished using traditional algorithms, such as Marching Cubes. However, since they rely on sign flips across the surface, these algorithms cannot be used directly on Unsigned Distance Fields (UDFs). In this work, we introduce a deep-learning approach to taking a UDF and turning it locally into an SDF, so that it can be effectively triangulated using existing algorithms. We show that it achieves better accuracy in surface localization than existing methods. Furthermore it generalizes well to unseen shapes and datasets, while being parallelizable. We also demonstrate the flexibily of the method by using it in conjunction with DualMeshUDF, a state of the art dual meshing method that can operate on UDFs, improving its results and removing the need to tune its parameters.


# 294
VF-NeRF: Viewshed Fields for Rigid NeRF Registration

Leo Segre · Shai Avidan

3D scene registration is a fundamental problem in computer vision that seeks the best 6-DoF alignment between two scenes. This problem was extensively investigated in the case of point clouds and meshes, but there has been relatively limited work regarding Neural Radiance Fields (NeRF). In this paper, we consider the problem of rigid registration between two NeRFs when the position of the original cameras is not given. Our key novelty is the introduction of Viewshed Fields (VF), an implicit function that determines, for each 3D point, how likely it is to be viewed by the original cameras. We demonstrate how VF can help in the various stages of NeRF registration, with an extensive evaluation showing that VF-NeRF achieves SOTA results on various datasets with different capturing approaches such as LLFF and Objaverese. Our code will be made publicly available.


# 155
Strong Double Blind
Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration

Xueyang Kang · Zhaoliang Luan · Kourosh Khoshelham · Bing Wang

Point cloud registration is a foundational task crucial for 3D alignment and reconstruction applications. While both traditional and learning-based registration approaches have achieved significant success, the leverage of intrinsic symmetry within input point cloud data often receives insufficient attention, which prohibits the model from sample leveraging and learning efficiency, leading to an increase in data size and model complexity. To address these challenges, we propose a dedicated graph neural network model embedded with a built-in Spherical Euclidean 3D equivariance property achieved through SE(3) node features and message passing equivariance. Pairwise input feature embeddings are derived from sparsely downsampled input point clouds, each with several orders of magnitude less point than two raw input point. These embeddings form a rigidity graph capturing spatial relationships, which is subsequently pooled into global features followed by cross-attention mechanisms, and finally decoded into the regression pose between point clouds. Experiments conducted on the 3DMatch and KITTI datasets exhibits the compelling and distinctive performance of our model compared to state-of-the-art approaches. By harnessing the equivariance properties inherent in the data, our model exhibits a notable reduction in the required input points during training when compared to existing approaches relying on dense input points. Moreover, our model eliminates the necessity for the removal or modeling of outliers present in dense input points, which simplifies the pre-processing pipeline of point cloud.


# 285
Strong Double Blind
Transferable 3D Adversarial Shape Completion using Diffusion Models

Xuelong Dai · Bin Xiao

Recent studies that incorporate geometric features and transformers into 3D point cloud feature learning have significantly improved the performance of 3D deep-learning models. However, their robustness against adversarial attacks has not been thoroughly explored. Existing attack methods primarily focus on white-box scenarios and struggle to transfer to recently proposed 3D deep-learning models. Even worse, these attacks introduce perturbations to 3D coordinates, generating unrealistic adversarial examples and resulting in poor performance against 3D adversarial defenses. In this paper, we generate high-quality adversarial point clouds using diffusion models. By using partial points as prior knowledge, we generate realistic adversarial examples through shape completion with adversarial guidance. The proposed adversarial shape completion allows for a more reliable generation of adversarial point clouds. To enhance attack transferability, we delve into the characteristics of 3D point clouds and employ model uncertainty for better inference of model classification through random down-sampling of point clouds. We adopt ensemble adversarial guidance for improved transferability across different network architectures. To maintain the generation quality, we limit our adversarial guidance solely to the critical points of the point clouds by calculating saliency scores. Extensive experiments demonstrate that our proposed attacks outperform state-of-the-art adversarial attack methods against both black-box models and defenses. Our black-box attack establishes a new baseline for evaluating the robustness of various 3D point cloud classification models.


# 286
Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

Shentong Mo · Enze Xie · Yue Wu · Junsong Chen · Matthias Niessner · ZHENGUO LI

Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. Specifically, we draw inspiration from masked autoencoders to dynamically operate the denoising process on masked voxelized point clouds. We also propose a novel voxel-aware masking strategy to adaptively aggregate background/foreground information from voxelized point clouds. Our method achieves state-of-the-art performance with an extreme masking ratio of nearly 99%. Moreover, to improve multi-category 3D generation, we introduce Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a distinct diffusion path with different experts, relieving gradient conflict. Experimental results on the ShapeNet dataset demonstrate that our method achieves state-of-the-art high-fidelity and diverse 3D point cloud generation performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage metrics when generating 128-resolution voxel point clouds, using only 6.5% of the original training cost.


# 291
Strong Double Blind
PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training

SUYI CHEN · Hao Xu · Haipeng Li · Kunming Luo · Guanghui Liu · Chi-Wing Fu · Ping Tan · Shuaicheng Liu

Data plays a crucial role in training learning-based methods for 3D point cloud registration. However, the real-world dataset is expensive to build, while rendering-based synthetic data suffers from domain gaps. In this work, we present PointGPT, boosting 3D Point cloud registration using Generative Point-cloud pairs for Training. Given a single depth map, we first apply a random camera motion to re-project it into a target depth map. Converting them to point clouds gives a training pair. To enhance the data realism, we formulate a generative model as a depth inpainting diffusion to process the target depth map with the re-projected source depth map as the condition. Also, we design a depth correction module to alleviate artifacts caused by point penetration during the re-projection. To our knowledge, this is the first generative approach that explores realistic data generation for indoor point cloud registration. When equipped with our approach, several recent algorithms can improve their performance significantly and achieve SOTA consistently on two common benchmarks. The code and dataset will be released upon publication.


# 281
Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds

Zicheng Wang · Zhen Zhao · Yiming Wu · Luping Zhou · Dong Xu

Unsupervised domain adaptation (UDA) is a critical challenge in the field of point cloud analysis, as models trained on one set of data often struggle to perform well in new scenarios due to large domain shifts. Previous works tackle the problem either by feature extractor adaptation to enable a shared classifier to distinguish domain-invariant features, or by classifier adaptation to evolve the classifier to recognize target-styled source features to increase its adaptation ability. However, by learning domain-invariant features, feature extractor adaptation methods fail to encode semantically meaningful target-specific information, while classifier adaptation methods rely heavily on the accurate estimation of the target distribution. In this work, we propose a novel framework that deeply couples the classifier and feature extractor adaption for 3D UDA, dubbed Progressive Classifier and Feature Extractor Adaptation (PCFEA). Our PCFEA conducts 3D UDA from two distinct perspectives: macro and micro levels. On the macro level, we propose a progressive target-styled feature augmentation (PTFA) that establishes a series of intermediate domains to enable the model to progressively adapt to the target domain. Throughout this process, the source classifier is evolved to recognize target-styled source features (\ie, classifier adaptation). On the micro level, we develop an intermediate domain feature extractor adaptation (IDFA) that performs a compact feature alignment to encourage the target-styled feature extraction gradually. In this way, PTFA and IDFA can mutually benefit each other: IDFA contributes to the distribution estimation of PTFA while PTFA constructs smoother intermediate domains to encourage an accurate feature alignment of IDFA. We validate our method on popular benchmark datasets, where our method achieves new state-of-the-art performance.


# 280
Domain Generalization of 3D Object Detection by Density-Resampling

Shuangzhi Li · Lei Ma · Xingyu Li

Point-cloud-based 3D object detection suffers from performance degradation when encountering data with novel domain gaps. To tackle it, single-domain generalization (SDG) aims to generalize the detection model trained in a limited single source domain to perform robustly on unexplored domains. Through analysis of errors and missed detections in 3D point clouds, it has become evident that challenges predominantly arise from variations in point cloud density, especially the sparsity of point cloud data. Thus in this paper, we propose an SDG method centered around the theme of point cloud density resampling, which involves using data augmentation to simulate point clouds of different densities and developing a novel point cloud densification algorithm to enhance the detection accuracy of low-density point clouds. Specifically, our physical-aware density-resampling data augmentation (PDDA) is the first to consider the physical constraints on point density distribution in data augmentation, leading to a more realistic simulation of variation in cloud density. In systematic design, an auxiliary self-supervised point cloud densification task is incorporated into the detection framework, forming a basis for test-time model update. By manipulating point cloud density, our method not only increases the model's adaptability to point clouds of different densities but also allows the self-supervised densification algorithm to serve as a metric for assessing the model's understanding of the environment and semantic information. This, in turn, enables a test-time adjustment of the model to better adapt to varying domains. Extensive cross-dataset experiments covering "Car", "Pedestrian", and "Cyclist" detections demonstrate our method outperforms state-of-the-art SDG methods and even overpass unsupervised domain adaptation methods under some circumstances. The code will be made publicly available.


# 290
Strong Double Blind
Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds

Yanni Ma · Hao Liu · Yun Pei · Yulan Guo

3D Scene Graph Prediction (SGP) aims to recognize the objects and predict their semantic and spatial relationships in a 3D scene. Existing methods either exploit context information or emphasize knowledge prior to model the scene graph in a fully-connected homogeneous graph framework. However, these methods may lead to indiscriminate message passing among graph nodes (i.e., objects), and thus obtain sub-optimal performance. In this paper, we propose a 3D heterogeneous scene graph prediction (3D-HetSGP) framework, which performs graph reasoning on the 3D scene graph in a heterogeneous fashion. Specifically, our method consists of two stages: a heterogeneous graph structure learning (HGSL) stage and a heterogeneous graph reasoning (HRG) stage. In the HGSL stage, we learn the graph structure by predicting the types of different directed edges. In the HRG stage, message passing among nodes is performed on the learned graph structure for scene graph prediction. Extensive experiments show that our method achieves comparable or superior performance to existing methods on 3DSSG. The code will be released after the acceptance of the paper.


# 322
Strong Double Blind
Physics-informed Knowledge Transfer for Underwater Monocular Depth Estimation

Jinghe Yang · Mingming Gong · Ye Pu

Compared to the in-air case, underwater depth estimation has its own challenges. For instance, acquiring high-quality training datasets with groundtruth poses difficulties due to sensor limitations in aquatic environments. Additionally, the physics characteristics of underwater imaging diverge significantly from the in-air case, the methods developed for in-air depth estimation underperform when applied underwater, due to the domain gap. To address these challenges, our paper introduces a novel transfer-learning-based method - Physics-informed Underwater Depth Estimation (\textbf{PUDE}). The key idea is to transfer the knowledge of a pre-trained in-air depth estimation model to underwater settings utilizing a small underwater image set without groundtruth measurement, guided by a physical model for underwater image formation. We propose novel bound losses based on the physical model to rectify the depth estimations to align with actual underwater physical properties. Finally, in the evaluations across multiple datasets, we compare our Physics-informed Underwater Depth Estimation (\textbf{PUDE}) model with other existing in-air and underwater methods. The results reveal that the \textbf{PUDE} model excels in both quantitative and qualitative comparisons.


# 295
Strong Double Blind
Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Yuanwen Yue · Anurag Das · Francis Engelmann · Siyu Tang · Jan Eric Lenssen

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D knowledge into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D knowledge when training 2D foundation models. Code will be released upon acceptance of the paper.


# 287
Strong Double Blind
SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

Nir Barel · Ron Aharon Shapira Weber · Nir Mualem · Shahaf Finder · Oren Freifeld

The unsupervised task of Joint Alignment (JA) of images is complicated by several hurdles such as high complexity, geometric distortions, and convergence to poor local optima or (due to undesired trivial solutions) even poor global optima. While it was recently shown that Vision Transformers (ViT) features are useful for, among other things, JA, it is important to understand that, by themselves, these features do not completely eliminate the aforementioned issues. Thus, even with ViT features, researchers opt to tackle JA using both 1) expensive models and 2) numerous regularization terms. Unfortunately, that approach leads to long training times and hard-to-tune (and dataset-specific) regularization hyperparameters. Our approach is different. We introduce the Spatial Joint Alignment Model (SpaceJAM), a lightweight method that simplifies JA considerably. Specifically, the method consists of a novel loss function, a Lie-algebraic parametrization, an efficient handling of flips, a small autoencoder, and a (recurrent) small Spatial Transformer Net. The entire model has only $\sim$16K trainable parameters. Of note, the regularization-free SpaceJAM obviates the need for explicitly maintaining an atlas during training. Optionally, and after solving the JA, SpaceJAM can generate an atlas in a single forward pass. Evaluated on the SPair-71K and CUB datasets, and compared to existing methods, SpaceJAM achieves better alignment capabilities and does so with orders of magnitude fewer trainable parameters and at least a 10x speedup in training time. Our code will be released upon acceptance.


# 298
3D Congealing: 3D-Aware Image Alignment in the Wild

Yunzhi Zhang · Zizhang Li · Amit Raj · Andreas Engelhardt · Yuanzhen Li · Tingbo Hou · Jiajun Wu · Varun Jampani

We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. Given a collection of unlabeled Internet images, our goal is to associate the shared semantic parts from the inputs and aggregate the knowledge from 2D images to a shared 3D canonical space. We introduce a general framework that tackles the task without assuming shape templates, poses, or any camera parameters. At its core is a canonical 3D representation that encapsulates geometric and semantic information. The framework optimizes for the canonical representation together with the pose for each input image, and a per-image coordinate map that warps 2D pixel coordinates to the 3D canonical frame to account for the shape matching. The optimization procedure fuses prior knowledge from a pre-trained image generative model and semantic information from input images. The former provides strong knowledge guidance for this under-constraint task, while the latter provides the necessary information to mitigate the training data bias from the pre-trained model. Our framework can be used for various tasks such as correspondence matching, pose estimation, and image editing, achieving strong results on real-world image datasets under challenging illumination conditions and on in-the-wild online image collections.


# 292
Strong Double Blind
Reprojection Errors as Prompts for Efficient Scene Coordinate Regression

Ting-Ru Liu · Hsuan-Kung Yang · Jou-Min Liu · Chun-Wei Huang · Tsung-Chih Chiang · Quan Kong · Norimasa Kobori · Chun-Yi Lee

Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the model. In this study, we first perform an in-depth analysis to validate the adverse impacts of these areas. Drawing inspiration from our analysis, we then introduce an error-guided feature selection (EGFS) mechanism, in tandem with the use of the Segment Anything Model (SAM). This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner. The experiments demonstrate that our method outperforms existing SCR approaches that do not rely on 3D information on the Cambridge Landmarks and Indoor6 datasets.


# 311
Strong Double Blind
Revisiting Calibration of Wide-Angle Radially Symmetric Cameras

Andrea Porfiri Dal Cin · Francesco Azzoni · Giacomo Boracchi · Luca Magri

Recent learning-based calibration methods yield promising results in estimating parameters for wide field-of-view cameras from single images. Yet, these end-to-end approaches are typically tethered to one fixed camera model, leading to issues: i) lack of flexibility, necessitating network architectural changes and retraining when changing camera models; ii) reduced accuracy, as a single model, limits the diversity of cameras represented in the training data; iii) restrictions in camera model selection, as learning-based methods need differentiable loss functions and, thus, undistortion equations with closed-form solutions. In response, we present a novel two-step calibration framework for radially symmetric cameras. Key to our approach is a specialized CNN that, given an input image, outputs an implicit camera representation (VaCR), mapping each image point to the direction of the 3D ray of light projecting onto it. The VaCR is used in a subsequent robust non-linear optimization process to determine the camera parameters for any radially symmetric model provided as input. By disentangling the estimation of camera model parameters from the VaCR, which is based only on the assumption of radial symmetry in the model, we overcome the main limitations of end-to-end approaches. Experimental results demonstrate the advantages of the proposed framework compared to state-of-the-art methods. Code is available at removed for blind review.


# 310
RGBD GS-ICP SLAM

Seongbo Ha · Jiung Yeon · Hyeonwoo Yu

Simultaneous Localization and Mapping (SLAM) with dense representation plays a key role in robotics, Virtual Reality (VR), and Augmented Reality (AR) applications. Recent advancements in dense representation SLAM have highlighted the potential of leveraging neural scene representation and 3D Gaussian representation for high-fidelity spatial representation. In this paper, we propose a novel dense representation SLAM approach with a fusion of Generalized Iterative Closest Point (G-ICP) and 3D Gaussian Splatting (3DGS). In contrast to existing methods, we utilize a single Gaussian map for both tracking and mapping, resulting in mutual benefits. Through the exchange of covariances between tracking and mapping processes with scale alignment techniques, we minimize redundant computations and achieve an efficient system. Additionally, we enhance tracking accuracy and mapping quality through our keyframe selection methods. Experimental results demonstrate the effectiveness of our approach, showing an incredibly fast speed up to 107 FPS (for the entire system) and superior quality of the reconstructed map.


# 309
FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos

Florian Langer · Jihong Ju · Georgi Dikov · Gerhard Reitmayr · Mohsen Ghafoorian

Digitising the 3D world into a clean, CAD model-based representation has important applications for augmented reality and robotics. Current state-of-the-art methods are computationally intensive as they individually encode each detected object and optimise CAD alignments in a second stage. In this work, we propose FastCAD, a real-time method that simultaneously retrieves and aligns CAD models for all objects in a given scene. In contrast to previous works, we directly predict alignment parameters and shape embeddings. We achieve high-quality shape retrievals by learning CAD embeddings in a contrastive learning framework and distilling those into FastCAD. Our single-stage method accelerates the inference time by a factor of 50 compared to other methods operating on RGB-D scans while outperforming them on the challenging Scan2CAD alignment benchmark. Further, our approach collaborates seamlessly with online 3D reconstruction techniques. This enables the real-time generation of precise CAD model-based reconstructions from videos at 10 FPS. Doing so, we significantly improve the Scan2CAD alignment accuracy in the video setting from 43.0% to 48.2% and the reconstruction accuracy from 22.9% to 29.6%.


# 300
GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

Pengyuan Wang · Takuya Ikeda · Robert Lee · Koichi Nishiwaki

Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of either pose-labelled real images or carefully tuned photorealistic simulators. This can be avoided by using only geometry inputs such as depth images to reduce the domain-gap but these approaches suffer from a lack of semantic information, which can be vital in the pose estimation problem. To resolve this conflict, we propose to utilize both geometric and semantic features obtained from a pre-trained foundation model. Our approach projects 2D semantic features into object models as 3D semantic point clouds. Based on the novel 3D representation, we further propose a self-supervision pipeline, and match the fused semantic point clouds against their synthetic rendered partial observations from synthetic object models. The learned knowledge from synthetic data generalizes to observations of unseen objects in the real scenes, without any fine-tuning. We demonstrate this with a rich evaluation on the NOCS, Wild6D and SUN RGB-D benchmarks, showing superior performance over geometric-only and semantic-only baselines with significantly fewer training objects.


# 296
Strong Double Blind
Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation

Mengchen Zhang · Tong Wu · Tai Wang · Tengfei Wang · Ziwei Liu · Dahua Lin

6D object pose estimation aims at determining an object's translation, rotation, and scale, typically from a single RGBD image. Recent advancements have expanded this estimation from instance-level to category-level, allowing models to generalize across unseen instances within the same category. However, this generalization is limited by the narrow range of categories covered by existing datasets, such as NOCS, which also tend to overlook common real-world challenges like occlusion. To tackle these challenges, we introduce Omni6D, a comprehensive RGBD dataset featuring a wide range of categories and varied backgrounds, elevating the task to a more realistic context. (1) The dataset comprises an extensive spectrum of 166 categories, 4688 instances adjusted to the canonical pose, and over 0.8 million captures, significantly broadening the scope for evaluation. (2) We introduce a symmetry-aware metric and conduct systematic benchmarks of existing algorithms on Omni6D, offering a thorough exploration of new challenges and insights. (3) Additionally, we propose an effective fine-tuning approach that adapts models from previous datasets to our extensive vocabulary setting. We believe this initiative will pave the way for new insights and substantial progress in both the industrial and academic fields, pushing forward the boundaries of general 6D pose estimation.


# 304
Strong Double Blind
Rotated Orthographic Projection for Self-Supervised 3D Human Pose Estimation

YAO YAO · Yixuan Pan · Wenjun Shi · Dongchen Zhu · Lei Wang · Jiamao Li

Reprojection consistency is widely used for self-supervised 3D human pose estimation. However, few efforts have been made to address the inherent limitations of reprojection consistency. Lacking camera parameters and absolute position, self-supervised methods map 3D poses to 2D using orthographic projection, whereas 2D inputs are derived from perspective projection. This discrepancy among the projection models creates an offset between the supervision and prediction spaces, limiting the performance potential. To address this problem, we propose rotated orthographic projection, which achieves a geometric approximation of the perspective projection by adding a rotation operation before the orthographic projection. Further, we optimize the reference point selection according to the human body structure and propose group rotated orthogonal projection, which significantly narrows the gap between the two projection models. Meanwhile, the reprojection consistency loss fails to constrain the Z-axis reverse wrong pose in 3D space. Therefore, We introduce the joint reverse constraint to limit the range of angles between the local reference plane and the end joints, penalizing unrealistic 3D poses and clarifying the Z-axis orientation of the model. The proposed method achieves state-of-the-art (SOTA) performance on both Human3.6M and MPII-INF-3DHP datasets. Particularly, it reduces the mean error from 65.9mm to 42.9mm (34.9% improvement) over the SOTA self-supervised method on Human3.6M.


# 276
Diffusion Model is a Good Pose Estimator from 3D RF-Vision

Junqiao Fan · Jianfei Yang · Yuecong Xu · Lihua Xie

Human pose estimation (HPE) from Radio Frequency vision (RF-vision) performs human sensing using RF signals that penetrate obstacles without revealing privacy (e.g., facial information). Recently, mmWave radar has emerged as a promising RF-vision sensor, providing radar point clouds by processing RF signals. However, the mmWave radar has a limited resolution with severe noise, leading to inaccurate and inconsistent human pose estimation. This work proposes mmDiff, a novel diffusion-based pose estimator tailored for noisy radar data. Our approach aims to provide reliable guidance as conditions to diffusion models for HPE. Two key challenges are addressed by mmDiff: (1) miss-detection of parts of human bodies may happen, which is addressed by a module that isolates feature extraction from different body parts, and (2) signal inconsistency due to environmental interference, which is tackled by incorporating prior knowledge of body structure and motion. Several modules are designed to achieve these goals, whose features work as the conditions for the subsequent diffusion model, eliminating the miss-detection and instability of HPE based on RF-vision. Extensive experiments demonstrate that mmDiff outperforms existing methods significantly, achieving state-of-the-art performances on public datasets.


# 307
Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding

niloofar azizi · Mohsen Fayyaz · Horst Bischof

Understanding human behavior fundamentally relies on accurate 3D human pose estimation. Graph Convolutional Networks (GCNs) have recently shown promising advancements, delivering state-of-the-art performance with rather lightweight architectures. In the context of graph-structured data, leveraging the eigenvectors of the graph Laplacian matrix for positional encoding is effective. Yet, the approach does not specify how to handle scenarios where edges in the input graph are missing. To this end, we propose a novel positional encoding technique, perturbPE, that extracts consistent and regular components from the eigenbasis. Our method involves applying multiple perturbations and taking their average to extract the consistent and regular component from the eigenbasis. perturbPE leverages the Rayleigh-Schrodinger Perturbation Theorem (RSPT) for calculating the perturbed eigenvectors. Employing this labeling technique enhances the robustness and generalizability of the model. Our results support our theoretical findings, e.g. our experimental analysis observed a performance enhancement of up to 12% on the Human3.6M dataset in instances where occlusion resulted in the absence of one edge. Furthermore, our novel approach significantly enhances performance in scenarios where two edges are missing, setting a new benchmark for state-of-the-art.


# 303
Strong Double Blind
Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image

Xingyu Liu · Pengfei Ren · Jingyu Wang · Qi Qi · Haifeng Sun · Zirui Zhuang · Jianxin Liao

Recent research has explored implicit representations, such as signed distance function (SDF), for interacting hand-object reconstruction. SDF enables modeling hand-held objects with arbitrary topology and overcomes the resolution limitations of parametric models, allowing for finer-grained reconstruction. However, modeling elaborate SDF directly based on visual features faces challenges due to depth ambiguity and appearance similarity, especially in cluttered real-world scenes. In this paper, we propose a coarse-to-fine SDF framework for 3D hand-object reconstruction, which leverages the perceptual advantages of RGB-D modality in both visual and geometric aspects, to progressively model the implicit field. Initially, we model coarse-level SDF using global image features to achieve a holistic perception of 3D scenes. Subsequently, we propose a 3D Point-Aligned Implicit Function (3D PIFu) for fine-level SDF learning, which leverages the local geometric clues of the point cloud to capture intricate details. To facilitate the transition from coarse to fine, we extract hand-object semantics from the implicit field as prior knowledge. Additionally, we propose a surface-aware efficient reconstruction strategy that sparsely samples query points based on the hand-object semantic prior. Experiments on two challenging hand-object datasets show that our method outperforms existing methods by a large margin.


# 301
Strong Double Blind
3D Reconstruction of Objects in Hands without Real World 3D Supervision

Aditya Prakash · Matthew Chang · Matthew Jin · Ruisen Tu · Saurabh Gupta

Prior works for reconstructing hand-held objects from a single image train models on images paired with 3D shapes. Such data is challenging to gather in the real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of a) in-the-wild raw video data showing hand-object interactions and b) synthetic 3D shape collections. In this paper, we propose modules to leverage 3D supervision from these sources to scale up the learning of models for reconstructing hand-held objects. Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments in the challenging object generalization setting on in-the-wild MOW dataset show 11.6% relative improvement over models trained with 3D supervision on existing datasets. Code, data, and models will be made available upon acceptance.


# 305
Strong Double Blind
Weakly-Supervised 3D Hand Reconstruction with Knowledge Prior and Uncertainty Guidance

Yufei Zhang · Jeffrey Kephart · Qiang Ji

Acquiring 3D data for fully-supervised monocular 3D hand reconstruction is often difficult, requiring specialized equipment for use in a controlled environment. Yet, fundamental principles governing hand movements are well-established in the understanding of the human hand's unique structure and functionality. In this paper, we leverage these valuable foundational insights for the training of 3D hand reconstruction models. Specifically, we systematically study hand knowledge from different sources, including hand biomechanics, functional anatomy, and physics. We effectively incorporate them into the reconstruction models by introducing a set of differentiable training losses accordingly. Instead of relying on 3D supervision, we consider a challenging weakly-supervised setting, where models are trained solely using 2D hand landmark annotation, which is considerably more accessible in practice. Moreover, different from existing methods that neglect the inherent uncertainty in image observations, we explicit model the uncertainty. We enhance the training process by exploiting a simple yet effective Negative Log-Likelihood (NLL) loss, which incorporates the well-captured uncertainty into the loss function. Through extensive experiments, we demonstrate that our method significantly outperforms state-of-the-art weakly-supervised methods. For example, our method achieves nearly a 21% performance improvement on the widely adopted FreiHAND dataset.


# 306
Strong Double Blind
MANIKIN: Biomechanically Accurate Neural Inverse Kinematics for Human Motion Estimation

Jiaxi Jiang · Paul Streli · Xuejing Luo · Christoph Gebhardt · Christian Holz

Current Mixed Reality systems aim to estimate a user's full-body joint configurations from just the pose of the end effectors, primarily head and hand poses. Existing methods often involve solving inverse kinematics (IK) to obtain the full skeleton from just these sparse observations, usually directly optimizing the joint angle parameters of a human skeleton. Since this accumulates error through the kinematic tree, predicted end effector poses fail to align with the provided input pose, leading to discrepancies in predicted and actual hand positions or feet that penetrate the ground. In this paper, we first refine the commonly used SMPL parametric model by embedding anatomical constraints that reduce the degrees of freedom for specific parameters to more closely mirror human biomechanics. This ensures that our model leads to physically plausible pose predictions. We then propose NeCoIK, a fully differentiable neural constrained IK solver for full-body motion tracking using just a person's head and hand. NeCoIK is based on swivel angle prediction and perfectly matches input poses while avoiding ground penetration. We evaluate NeCoIK in extensive experiments on motion capture datasets and demonstrate that our method surpasses the state of the art in quantitative and qualitative results at fast inference speed.


# 302
Strong Double Blind
Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection

Kangqi Ma · Hao Dong · Yadong Mu

This paper addresses the challenge of robotic grasping of general objects. Similar to prior research, the task reads a single-view 3D observation (i.e., point clouds) captured by a depth camera as input. Crucially, the success of object grasping highly demands a comprehensive understanding of the shape of objects within the scene. However, single-view observations often suffer from occlusions (including both self and inter-object occlusions), which lead to gaps in the point clouds, especially in complex cluttered scenes. This renders incomplete perception of the object shape and frequently causes failures or inaccurate pose estimation during object grasping. In this paper, we tackle this issue with an effective albeit simple solution, namely completing grasping-related scene regions through local occupancy prediction. Following prior practice, the proposed model first runs by proposing a number of most likely grasp points in the scene. Around each grasp point, a module is designed to infer any voxel in its neighborhood to be either void or occupied by some object. Importantly, the occupancy map is inferred by fusing both local and global cues. We implement a multi-group tri-plane scheme for efficiently aggregating long-distance contextual information. The model further estimates 6-DoF grasp poses utilizing the local occupancy-enhanced object shape information and returns the top-ranked grasp proposal. Comprehensive experiments on both the large-scale GraspNet-1Billion benchmark and real robotic arm demonstrate that the proposed method can effectively complete the unobserved parts in cluttered and occluded scenes. Benefiting from the occupancy-enhanced feature, our model clearly outstrips other competing methods under various performance metrics such as grasping average precision.


# 299
GraspXL: Generating Grasping Motions for Diverse Objects at Scale

Hui Zhang · Sammy Christen · Zicong Fan · Otmar Hilliges · Jie Song

Human hands possess the dexterity to interact with diverse objects such as grasping specific parts of the objects and/or approaching them from desired directions. More importantly, humans can grasp objects of any shape without object-specific skills. Recent works synthesize grasping motions following single objectives such as a desired approach heading direction or a grasping area. Moreover, they usually rely on expensive 3D hand-object data during training and inference, which limits their capability to synthesize grasping motions for unseen objects at scale. In this paper, we unify the generation of hand-object grasping motions across multiple motion objectives, diverse object shapes and dexterous hand morphologies in a policy learning framework GraspXL. The objectives are composed of the graspable area, heading direction during approach, wrist rotation, and hand position. Without requiring any 3D hand-object interaction data, our policy trained with 58 objects can robustly synthesize diverse grasping motions for more than 500k unseen objects with a success rate of 82.2%. At the same time, the policy adheres to objectives, which enables the generation of diverse grasps per object. Moreover, we show that our framework can be deployed to different dexterous hands and work with reconstructed or generated objects. We quantitatively and qualitatively evaluate our method to show the efficacy of our approach. Our model and code will be available.


# 315
Strong Double Blind
HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos

Lixin Xue · Chen Guo · Chengwei Zheng · Fangjinhua Wang · Tianjian Jiang · Hsuan-I Ho · Manuel Kaufmann · Jie Song · Otmar Hilliges

An overarching goal for computer-aided perception systems is the holistic understanding of the human-centric 3D world, including faithful reconstructions of humans, scenes, and their global spatial relationships. While recent progress in monocular 3D reconstruction has been made for footage of either humans or scenes alone, jointly reconstructing both humans and scenes along with their global spatial information remains an unsolved challenge. To address this challenge, we introduce a novel and unified framework that simultaneously achieves temporally and spatially coherent 3D reconstruction of a static scene with dynamic humans from a monocular RGB video. Specifically, we parameterize temporally consistent canonical human models and static scene representations using two neural fields in a shared 3D space. We further develop a global optimization framework that considers physical constraints imposed by potential human-scene interpenetration and occlusion. Compared to independent reconstructions, our framework enables detailed and holistic geometry reconstructions of both humans and scenes. Moreover, we introduce a new synthetic dataset for quantitative evaluations. Extensive experiments and ablation studies on both real-world and synthetic videos demonstrate the efficacy of our framework in monocular human-scene reconstruction. Code and data are publicly available on our \href{https://lxxue.github.io/human-scene-recon/}{project page}.


# 321
Strong Double Blind
Object-Aware NIR-to-Visible Translation

Yunyi Gao · Lin Gu · Qiankun Liu · Ying Fu

While near-infrared (NIR) imaging technology is essential for assisted driving and safety monitoring systems, its monochromatic nature and detail limitations hinder its broader application, which prompts the development of NIR-to-visible translation tasks. However, the performance of existing translation methods is limited by the neglected disparities between NIR and visible imaging and the lack of paired training data. To address these challenges, we propose a novel object-aware framework for NIR-to-visible translation. Our approach decomposes the visible image recovery into object-independent luminance sources and object-specific reflective components, processing them separately to bridge the gap between NIR and visible imaging under various lighting conditions. Leveraging prior segmentation knowledge enhances our model's ability to identify and understand the separated object reflection. We also collect the Fully Aligned NIR-Visible Image Dataset, a large-scale dataset comprising fully matched pairs of NIR and visible images captured with a multi-sensor coaxial camera. Empirical evaluations demonstrate the superiority of our approach over existing methods, producing visually compelling results on mainstream datasets.


# 323
Strong Double Blind
SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models

Dongseok Shim · Hyoun Jin Kim

In monocular depth estimation, it is challenging to acquire a large amount of depth-annotated training data, which leads to a reliance on synthetic datasets. However, the inherent discrepancies between the synthetic environment and the real-world result in a domain shift and sub-optimal performance. In this paper, we introduces SEDiff which leverages a diffusion-based generative model to extract essential structural information for accurate depth estimation. SEDiff wipes out the domain-specific components in the synthetic data and enables structural-consistent image transfer to mitigate the performance degradation due to the domain gap. Extensive experiments demonstrate the superiority of SEDiff over state-of-the-art methods in various scenarios for domain-adaptive depth estimation.


# 293
Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion

Huadong Li · Minhao Jing · Jin Wang · Shichao Dong · Jiajun Liang · Haoqiang Fan · Renhe Ji

It is widely believed that sparse supervision is worse than dense supervision in the field of depth completion, but the underlying reasons for this are rarely discussed. To this end, we revisit the task of radar-camera depth completion and present a new method with sparse LiDAR supervision to outperform previous dense LiDAR supervision methods in both accuracy and speed. Specifically, when trained by sparse LiDAR supervision, depth completion models usually output depth maps containing significant stripe-like artifacts. We find that such a phenomenon is caused by the implicitly learned positional distribution pattern from sparse LiDAR supervision, termed as LiDAR Distribution Leakage (LDL) in this paper. Based on such understanding, we present a novel Disruption-Compensation radar-camera depth completion framework to address this issue. The Disruption part aims to deliberately disrupt the learning of LiDAR distribution from sparse supervision, while the Compensation part aims to leverage 3D spatial and 2D semantic information to compensate for the information loss of previous disruptions. Extensive experimental results demonstrate that by reducing the impact of LDL, our framework with sparse supervision outperforms the state-of-the-art dense supervision methods with 11.6% improvement in Mean Absolute Error (MAE) and 1.6x speedup in Frame Per Second (FPS). The code will be released when the paper is accepted.


# 274
Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

Genki Kinoshita · Ko Nishino

In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as StableCamH. The key idea is to leverage cars found on the road as sources of scale supervision but to incorporate them in the training robustly. StableCamH detects and estimates the sizes of cars in the frame and aggregates scale information extracted from them into a camera height estimate whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network to become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and Cityscapes datasets show the effectiveness of StableCamH and its state-of-the-art accuracy compared with related methods. We also show that StableCamH enables training on mixed datasets of different camera heights, which leads to larger-scale training and thus higher generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and StableCamH democratizes its deployment by establishing the means to train any model as a metric depth estimator.


# 270
Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

Zimin Xia · Yujiao Shi · HONGDONG LI · Julian Kooij

Given a ground-level query image and a geo-referenced aerial image that covers the query's local surroundings, fine-grained cross-view localization aims to estimate the location of the ground camera inside the aerial image. Recent works have focused on developing advanced networks trained with accurate ground truth (GT) locations of ground images. However, the trained models always suffer a performance drop when applied to images in a new target area that differs from training. In most deployment scenarios, acquiring fine GT, i.e. accurate GT locations, for target-area images to re-train the network can be expensive and sometimes infeasible. In contrast, collecting images with coarse GT with errors of tens of meters is often easy. Motivated by this, our paper focuses on improving the performance of a trained model in a new target area by leveraging only the target-area images without fine GT. We propose a weakly supervised learning approach based on knowledge self-distillation. This approach uses predictions from a pre-trained model as pseudo GT to supervise a copy of itself. Our approach includes a mode-based pseudo GT generation for reducing uncertainty in pseudo GT and an outlier filtering method to remove unreliable pseudo GT. We validate our approach using two recent state-of-the-art models on two benchmarks. The results demonstrate that our approach consistently and considerably boosts the localization accuracy in the target area.


# 153
DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

Jiuming Liu · Dong Zhuo · Zhiheng Feng · Siting Zhu · Chensheng Peng · Zhe Liu · Hesheng Wang

Information inside visual and LiDAR data is well complementary derived from the fine-grained texture of images and massive geometric information in point clouds. However, it remains challenging to explore effective visual-LiDAR fusion, mainly due to the intrinsic data structure inconsistency between two modalities: Images are regular and dense, but LiDAR points are unordered and sparse. To address the problem, we propose a local-to-global fusion network with bi-directional structure alignment. To obtain locally fused features, we project points onto image plane as cluster centers and cluster image pixels around each center. Image pixels are pre-organized as pseudo points for image-to-point structure alignment. Then, we convert points to pseudo images by cylindrical projection (point-to-image structure alignment) and perform adaptive global feature fusion between point features with local fused features. Our method achieves state-of-the-art performance on KITTI odometry and FlyingThings3D scene flow datasets compared to both single-modal and multi-modal methods. Codes will be released upon publication.


# 258
Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

Feng Liu · Tengteng Huang · Qianjing Zhang · Haotian Yao · Chi Zhang · Fang Wan · Qixiang Ye · Yanzhao Zhou

Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to differentiate from true positives, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives. Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors, and it only minimally increases training computational costs without affecting inference speed. Our comprehensive experiments, including detailed ablation studies, consistently demonstrate that Ray Denoising outperforms strong baselines across multiple datasets. It achieves a 1.9\% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset. It shows significant performance gains on the Argoverse 2 dataset, highlighting its generalization capability. The code is included in supplementary materials and will be publicly available.


# 271
DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception

Kai Jiang · Jiaxing Huang · Weiying Xie · Jie Lei · Yunsong Li · Ling Shao · Shijian Lu

Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space. However, most existing studies were conducted under a supervised setup which cannot scale well while handling various new data. Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored. In this work, we design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features. DA-BEV introduces the idea of query into the domain adaptation framework to derive useful information from image-view and BEV features. It consists of two query-based designs, namely, query-based adversarial learning (QAL) and query-based self-training (QST), which exploits image-view features or BEV features to regularize the adaptation of the other. Extensive experiments show that DA-BEV achieves superior domain adaptive BEV perception performance consistently across multiple datasets and tasks such as 3D object detection and 3D scene segmentation.


# 275
Strong Double Blind
LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Sanmin Kim · Youngseok Kim · Sihwan Hwang · Hyeonjun Jeong · Dongsuk Kum

Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occluded objects, which should not be transferred to the image detector. To mitigate these imperfections in LiDAR teacher, we propose a novel method that leverage aleatoric uncertainty-free features from ground truth labels. In contrast to conventional label guidance approaches, we approximate the inverse function of the teacher's head to effectively embed label inputs into feature space. This approach provides additional accurate guidance alongside LiDAR teacher, thereby boosting the performance of the image detector. Additionally, we introduce feature partitioning, which effectively transfers knowledge from the teacher modality while preserving the distinctive features of the student, thereby maximizing the potential of both modalities. Experimental results demonstrate that our approach improves mAP and NDS by 5.1 points and 4.9 points compared to the baseline model, proving the effectiveness of our approach. The code will be made publicly available soon.


# 273
Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection

Junjie Huang · Yun Ye · Zhujin Liang · Yi Shan · Dalong Du

3D object Detection with LiDAR-camera encounters overfitting in algorithm development which is derived from the violation of some fundamental rules. We refer to the data annotation in dataset construction for theoretical optimization and argue that the regression task prediction should not involve the feature from the camera branch. By following the cutting-edge perspective of 'Detecting As Labeling', we propose a novel paradigm dubbed DAL. With the most classical elementary algorithms, a simple predicting pipeline is constructed by imitating the data annotation process. Then we train it in the simplest way to minimize its dependency and strengthen its portability. Though simple in construction and training, the proposed DAL paradigm not only substantially pushes the performance boundary but also provides a superior trade-off between speed and accuracy among all existing methods. With comprehensive superiority, DAL is an ideal baseline for both future work development and practical deployment. The code has been released to facilitate future work.


# 254
Strong Double Blind
RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection

Ming Chang · Xishan Zhang · Rui Zhang · Zhipeng Zhao · Guanhua He · Shaoli Liu

Long-term temporal fusion is frequently employed in camera-based Bird’s-Eye-View (BEV) 3D object detection to improve detection of occluded objects. Existing methods can be divided into two categories, parallel fusion and recurrent fusion. Recurrent fusion reduces inference latency and memory consumption but fails to exploit the long-term information as well as parallel fusion. In this paper, we first find two shortcomings of recurrent fusion paradigm: (1) Gradients of previous BEV features cannot directly contribute to the fusion module. (2) Semantic ambiguity are caused by coarse granularity of the BEV grids during aligning BEV features. Then based on the above analysis, we propose RecurrentBEV, a novel recurrent temporal fusion method for BEV based 3D object detector. By adopting RNN-style back-propagation and new-designed inner grid transformation, RecurrentBEV improves the long-term fusion ability while still enjoying efficient inference latency and memory consumption during inference. Extensive experiments on the nuScenes benchmark demonstrate its effectiveness, achieving a new state-of-the-art performance of 57.4 mAP and 65.1 NDS on the test set. The real-time version (25.6 FPS) achieves 44.5 mAP and 54.9 NDS without external dataset, outperforming the previous best method StreamPETR by 1.3 mAP and 0.9 NDS.


# 255
Strong Double Blind
JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

Brian Cheong · Jiachen Zhou · Steven Waslander

Tracking-by-detection (TBD) methods achieve state-of-the-art performance on 3D tracking benchmarks for autonomous driving. On the other hand, tracking-by-attention (TBA) methods have the potential to outperform TBD methods, particularly for long occlusions and challenging detection settings. This work investigates why TBA methods continue to lag in performance behind TBD methods using a LiDAR-based joint detector and tracker called JDT3D. Based on this analysis, we propose two generalizable methods to bridge the gap between TBD and TBA methods: track sampling augmentation and confidence-based query propagation. JDT3D is trained and evaluated on the nuScenes dataset, achieving 0.574 on the AMOTA metric on the nuScenes test set, outperforming all existing LiDAR-based TBA approaches by over 6%. Based on our results, we further discuss some potential challenges with the existing TBA model formulation to explain the continued gap in performance with TBD methods. The implementation of JDT3D can be found at the following link: https://anonymous.4open.science/r/JDT3D-FC12


# 277
MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception

Mohammad Mahbubur Rahman · Ryoma Yataka · Sorachi Kato · Pu Wang · Peizhao Li · Adriano Cardace · Petros Boufounos

Compared with an extensive list of automotive radar datasets that support autonomous driving, indoor radar datasets are scarce at a smaller scale in the format of low-resolution radar point clouds and usually under an open-space single-room setting. In this paper, we scale up indoor radar data collection using multi-view high-resolution radar heatmap in a multi-day, multi-room, and multi-subject setting, with an emphasis on the diversity of environment and subjects. Referred to as the millimeter-wave multi-view radar (\textbf{MMVR}) dataset, it consists of $345$K multi-view radar frames collected from $25$ human subjects over $6$ different rooms, $446$K annotated bounding boxes/segmentation instances, and $7.59$ million annotated keypoints to support three major perception tasks of object detection, pose estimation, and instance segmentation, respectively. For each task, we report performance benchmarks under two protocols: a single subject in an open space and multiple subjects in several cluttered rooms with two data splits: random split and cross-environment split over $395$ 1-min data segments. We anticipate that MMVR facilitates indoor radar perception development for indoor vehicle (robot/humanoid) navigation, building energy management, and elderly care for better efficiency, user experience, and safety. The MMVR dataset is available at https://doi.org/10.5281/zenodo.12611978.


# 268
Strong Double Blind
UAV First-Person Viewers Are Radiance Field Learners

Liqi Yan · Qifan Wang · Junhan Zhao · Qiang Guan · Zheng Tang · Jianhui Zhang · Dongfang Liu

First-Person-View (FPV) holds immense potential for revolutionizing the trajectory of Unmanned Aerial Vehicles (UAVs), offering an exhilarating avenue for navigating complex building structures. Yet, traditional Neural Radiance Field (NeRF) methods face challenges such as sampling single points per iteration and requiring an extensive array of views for supervision. UAV videos exacerbate these issues with limited viewpoints and significant spatial scale variations, resulting in inadequate detail rendering across diverse scales. In response, we introduce FPV-NeRF, addressing these challenges through three key facets: (1) Temporal consistency. Leveraging spatio-temporal continuity ensures seamless coherence between frames; (2) Global structure. Incorporating various global features during point sampling preserves space integrity; (3) Local granularity. Employing a comprehensive framework and multi-resolution supervision for multi-scale scene feature representation tackles the intricacies of UAV video spatial scales. Additionally, due to the scarcity of publicly available FPV videos, we introduce an innovative view synthesis method using NeRF to generate FPV perspectives from UAV footage, enhancing spatial perception for drones. Our novel dataset spans diverse trajectories, from outdoor to indoor environments, in the UAV domain, differing significantly from traditional NeRF scenarios. Through extensive experiments encompassing both interior and exterior building structures, FPV-NeRF demonstrates a superior understanding of the UAV flying space, outperforming state-of-the-art methods in our curated UAV dataset. Explore our project page for further insights: https://fpv-nerf.github.io/.


# 267
Caltech Aerial RGB-Thermal Dataset in the Wild

Connor Lee · Matthew Anderson · Nikhil Ranganathan · Xingxing Zuo · Kevin T · Georgia Gkioxari · Soon-Jo Chung

We present the first publicly available RGB-thermal dataset designed for aerial robotics operating in natural environments. Our dataset captures a variety of terrains across the continental United States, including rivers, lakes, coastlines, deserts, and forests, and consists of synchronized RGB, long-wave thermal, global positioning, and inertial data. Furthermore, we provide semantic segmentation annotations for 10 classes commonly encountered in natural settings in order to facilitate the development of perception algorithms robust to adverse weather and nighttime conditions. Using this dataset, we propose new and challenging benchmarks for thermal and RGB-thermal semantic segmentation, RGB-to-thermal image translation, and visual-inertial odometry. We present extensive results using state-of-the-art methods and highlight the challenges posed by temporal and geographical domain shifts in our data.


# 265
V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception

Hao Xiang · Xin Xia · Zhaoliang Zheng · Runsheng Xu · Letian Gao · Zewei Zhou · xu han · Xinkai Ji · Mingxi Li · Zonglin Meng · Li Jin · Mingyue Lei · Zhaoyang Ma · Zihang He · Haoxuan Ma · Yunshuang Yuan · Yingqian Zhao · Jiaqi Ma

Recent advancements in Vehicle-to-Everything (V2X) technologies have enabled autonomous vehicles to share sensing information to see through occlusions, greatly boosting the perception capability. However, there are no real-world datasets to facilitate the real V2X cooperative perception research -- existing datasets either only support Vehicle-to-Infrastructure cooperation or Vehicle-to-Vehicle cooperation. In this paper, we present V2X-Real, a large-scale dataset that includes a mixture of multiple vehicles and smart infrastructure to facilitate the V2X cooperative perception development with multi-modality sensing data. Our V2X-Real is collected using two connected automated vehicles and two smart infrastructure, which are all equipped with multi-modal sensors including LiDAR sensors and multi-view cameras. The whole dataset contains 33K LiDAR frames and 171K camera data with over 1.2M annotated bounding boxes of 10 categories in very challenging urban scenarios. According to the collaboration mode and ego perspective, we derive four types of datasets for Vehicle-Centric, Infrastructure-Centric, Vehicle-to-Vehicle, and Infrastructure-to-Infrastructure cooperative perception. Comprehensive multi-class multi-agent benchmarks of SOTA cooperative perception methods are provided. The V2X-Real dataset and codebase are available at https://mobility-lab.seas.ucla.edu/v2x-real.


# 272
Strong Double Blind
CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Zhangchen Ye · Tao Jiang · Chenfeng Xu · Yiming Li · Hang Zhao

Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to enhance the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost.


# 250
Revisit Human-Scene Interaction via Space Occupancy

Xinpeng Liu · Haowen Hou · Yanchao Yang · Yong-Lu Li · Cewu Lu

Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is its limited data scale. High-quality data with simultaneously captured human and 3D environments is hard to acquire, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, which is stringent to human movements, the controller could handle cramped scenes and generalize well to general scenes with limited complexity like regular living rooms. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. Our code and data would be made publicly available.


# 266
Strong Double Blind
Enhancing Vectorized Map Perception with Historical Rasterized Maps

Xiaoyu Zhang · Guangwei Liu · Zihao Liu · Ningyi Xu · Yunhui Liu · Ji Zhao

In autonomous driving, there is growing interest in end-to-end online vectorized map perception in bird's-eye-view (BEV) space, with an expectation that it could replace traditional high-cost offline high-definition (HD) maps. However, the accuracy and robustness of these methods can be easily compromised in challenging conditions, such as occlusion or adverse weather, when relying only on onboard sensors. In this paper, we propose HRMapNet, leveraging a low-cost Historical Rasterized Map to enhance online vectorized map perception. The historical rasterized map can be easily constructed from past predicted vectorized results and provides valuable complementary information. To fully exploit a historical map, we propose two novel modules to enhance BEV features and map element queries. For BEV features, we employ a feature aggregation module to encode features from both onboard images and the historical map. For map element queries, we design a query initialization module to endow queries with priors from the historical map. The two modules contribute to leveraging map information in online perception. Our HRMapNet can be integrated with most online vectorized map perception methods. We integrate it in two state-of-the-art methods, significantly improving their performance on both the nuScenes and Argoverse 2 datasets. The source code will be released online.


# 269
Strong Double Blind
RoadPainter: Points Are Ideal Navigators for Topology transformER

Zhongxing Ma · Liang Shuang · Yongkun Wen · Weixin Lu · Guowei Wan

Topology reasoning aims to provide a comprehensive and precise understanding of road scenes, enabling autonomous driving systems to identify safe and efficient navigation routes. In this paper, we present RoadPainter, an innovative approach for detecting and reasoning about the topology of lane centerlines using multi-view images. The core concept behind RoadPainter is to extract a set of points from each centerline mask to improve the accuracy of centerline prediction. To achieve this, we start by implementing a transformer decoder that integrates a hybrid attention mechanism and a real-virtual separation strategy to predict coarse lane centerlines and establish topological associations. Subsequently, we generate centerline instance masks guided by the centerline points from the transformer decoder. Moreover, we derive an additional set of points from each mask and combine them with previously detected centerline points for further refinement. Additionally, we introduce an optional module that incorporates a Standard Definition (SD) map to further optimize centerline detection and enhance topological reasoning performance. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of RoadPainter.


# 262
Strong Double Blind
VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Seokha Moon · Hyun Woo · Hongbeen Park · Haeji Jung · Reza Mahjourian · Hyung-gun Chi · Hyerin Lim · Sangpil Kim · Jinkyu Kim

Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs to a model which predicts agent trajectories. In this work, we propose a novel method that also incorporates visual input from surround-view cameras, allowing the model to utilize visual cues such as human gazes and gestures, road conditions, vehicle turn signals, etc, which are typically hidden from the model in prior trajectory prediction methods. Furthermore, we use textual descriptions generated by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM) as supervision to guide the model on what to learn from the input data. Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance, and our qualitative analysis highlights how the model is able to exploit these additional inputs. Despite using these extra inputs, our method achieves a latency of 53 ms, significantly lower than that of previous single-agent prediction methods with similar performance. Lastly, in this work we create and release the nuScenes-Text dataset, which augments the established nuScenes dataset with rich textual annotations for every scene, demonstrating the positive impact of utilizing VLM on trajectory prediction.


# 263
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Xiaofeng Wang · Zheng Zhu · Guan Huang · Chen Xinze · Jiagang Zhu · Jiwen Lu

World models, especially in autonomous driving, are trending and drawing extensive attention due to their capacity for comprehending driving environments. The established world model holds immense potential for the generation of high-quality driving videos, and driving policies for safe maneuvering. However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce \textit{DriveDreamer}, a pioneering world model entirely derived from real-world driving scenarios. Regarding that modeling the world in intricate driving scenes entails an overwhelming search space, we propose harnessing the powerful diffusion model to construct a comprehensive representation of the complex environment. Furthermore, we introduce a two-stage training pipeline. In the initial phase, \textit{DriveDreamer} acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. Extensive experiments are conducted to verify that \textit{DriveDreamer} empowers both driving video generation and action prediction, faithfully capturing real-world traffic constraints. Furthermore, videos generated by \textit{DriveDreamer} significantly enhance the training of driving perception methods.


# 264
SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic

Kashyap Chitta · Daniel Dauner · Andreas Geiger

SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder (RVAE). It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40\% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500$\times$ less storage to set up (<4GB), making it a more accessible option and helping with democratizing future research in this field.


# 178
Self-Supervised Video Desmoking for Laparoscopic Surgery

Renlong Wu · Zhilu Zhang · Shuohao Zhang · Longfei Gou · Haobin Chen · Yabin Zhang · Hao Chen · Wangmeng Zuo

Due to the difficulty of collecting real paired data, most existing desmoking methods train the models by synthesizing smoke, generalizing poorly to real surgical scenarios. Although a few works have explored single-image real-world desmoking in unpaired learning manners, they still encounter challenges in handling dense smoke. In this work, we address these issues together by introducing the self-supervised surgery video desmoking (SelfSVD). On the one hand, we observe that the frame captured before the activation of high-energy devices is generally clear (named pre-smoke frame, PS frame), thus it can serve as supervision for other smoky frames, making real-world self-supervised video desmoking practically feasible. On the other hand, in order to enhance the desmoking performance, we further feed the valuable information from PS frame into models, where a masking strategy and a regularization term are presented to avoid trivial solutions. In addition, we construct a real surgery video dataset for desmoking, which covers a variety of smoky scenes. Extensive experiments on the dataset show that our SelfSVD can remove smoke more effectively and efficiently while recovering more photo-realistic details than the state-of-the-art methods. The dataset, codes, and pre-trained models will be publicly available.


# 251
Strong Double Blind
BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

Yijin Li · Yichen Shen · Zhaoyang Huang · Shuo Chen · Weikang Bian · Xiaoyu Shi · Fu-Yun Wang · Keqiang Sun · Hujun Bao · Zhaopeng Cui · Guofeng Zhang · Hongsheng LI

Recent advances in event-based vision suggest that they complement traditional cameras by providing continuous observation without frame rate limitations and high dynamic range which are well-suited for correspondence tasks such as optical flow and point tracking. However, so far there is still a lack of comprehensive benchmarks for correspondence tasks with both event data and images. To fill this gap, we propose BlinkVision, a large-scale and diverse benchmark with rich modality and dense annotation of correspondence. BlinkVision has several appealing properties: 1) Rich modalities: It encompasses both event data and RGB images. 2) Rich annotations: It provides dense per-pixel annotations covering optical flow, scene flow, and point tracking. 3) Large vocabulary: It incorporates 410 daily categories, sharing common classes with widely-used 2D and 3D datasets such as LVIS and ShapeNet. 4) Naturalistic: It delivers photorealism data and covers a variety of naturalistic factors such as camera shake and deformation. BlinkVision enables extensive benchmarks on three types of correspondence tasks (i.e., optical flow, point tracking and scene flow estimation) for both image-based methods and event-based methods, leading to new observations, practices, and insights for future research. The benchmark website is https://www.blinkvision.net/.


# 252
Strong Double Blind
LiDAR-Event Stereo Fusion with Hallucinations

Luca Bartolomei · Matteo Poggi · Andrea Conti · Stefano Mattoccia

Event stereo matching is an emerging technique to estimate depth with event cameras. However, events themselves are unlikely to trigger in the absence of motion or the presence of large, untextured regions, making the correspondence problem much more challenging. Purposely, we propose integrating a stereo event camera with a fixed-frequency active sensor -- e.g., a LiDAR -- collecting sparse depth measurements, overcoming previous limitations while still enjoying the microsecond resolution of event cameras. Such depth hints are used to hallucinate events in the stacks or along the raw streams, compensating for the lack of information in the absence of brightness changes. Our techniques are general, can be adapted to any structured representation to stack events proposed in the literature, and outperform state-of-the-art fusion methods applied to event-based stereo.


# 243
Temporal-Mapping Photography for Event Cameras

Yuhan Bao · Lei Sun · Yuqin Ma · Kaiwei Wang

Event cameras, or Dynamic Vision Sensors (DVS) are novel neuromorphic sensors that capture brightness changes as a continuous stream of “events” rather than traditional intensity frames. Converting sparse events to dense intensity frames faithfully has long been an ill-posed problem. Previous methods have primarily focused on converting events to video in dynamic scenes or with a moving camera. In this paper, for the first time, we realize events to dense intensity image conversion using a stationary event camera in static scenes with a transmittance adjustment device for brightness modulation. Different from traditional methods that mainly rely on event integration, the proposed Event-Based Temporal Mapping Photography (EvTemMap) measures the time of event emitting for each pixel. Then, the resulting Temporal Matrix is converted to an intensity frame with a temporal mapping neural network. At the hardware level, the proposed EvTemMapis implemented by combining a transmittance adjustment device with a DVS, named Adjustable Transmittance Dynamic Vision Sensor. Additionally, we collected TemMat dataset under various conditions including low-light and high dynamic range scenes. The experimental results showcase the high dynamic range, fine-grained details, and high-grayscale-resolution of the proposed EvTemMap. The code and dataset are available in https://github.com/YuHanBaozju/EvTemMap.


# 244
Strong Double Blind
Motion Aware Event Representation-driven Image Deblurring

Zhijing Sun · Xueyang Fu · Longzhuo Huang · Aiping Liu · Zheng-Jun Zha

Traditional image deblurring struggles with high-quality reconstruction due to limited motion data from single blurred images. Excitingly, the high-temporal resolution of event cameras records motion more precisely in a different modality, transforming image deblurring. However, many event camera-based methods, which only care about the final value of the polarity accumulation, ignore the influence of the absolute intensity change where events generate so fall short in perceiving motion patterns and effectively aiding image reconstruction. To overcome this, in this work, we propose a new event preprocessing technique that accumulates the deviation from the initial moment each time the event is updated. This process can distinguish the order of events to improve the perception of object motion patterns. To complement our proposed event representation, we create a recurrent module designed to meticulously extract motion features across local and global time scales. To further facilitate the event feature and image feature integration, which assists in image reconstruction, we develop a bi-directional feature alignment and fusion module. This module works to lessen inter-modal inconsistencies. Our approach has been thoroughly tested through rigorous experiments carried out on several datasets with different distributions. These trials have delivered promising results, with our method achieving top-tier performance in both quantitative and qualitative assessments.


# 237
Event-Based Motion Magnification

Yutian Chen · Shi Guo · Yu Fangzheng · Feng Zhang · Jinwei Gu · Tianfan Xue

Detecting and magnifying imperceptible high-frequency motions in real-world scenarios has substantial implications for industrial and medical applications. These motions are characterized by small amplitudes and high frequencies. Traditional motion magnification methods rely on costly high-speed cameras or active light sources, which limit the scope of their applications. In this work, we propose a dual-camera system consisting of an event camera and a conventional RGB camera for video motion magnification, providing temporally-dense information from the event stream and spatially-dense data from the RGB images. This innovative combination enables a broad and cost-effective amplification of high-frequency motions. By revisiting the physical camera model, we observe that estimating motion direction and magnitude necessitates the integration of event streams with additional image features. On this basis, we propose a novel deep network tailored for event-based motion magnification. Our approach utilizes the Second-order Recurrent Propagation module to proficiently interpolate multiple frames while addressing artifacts and distortions induced by magnified motions. Additionally, we employ a temporal filter to distinguish between noise and useful signals, thus minimizing the impact of noise. We also introduced the first event-based motion magnification dataset, which includes a synthetic subset and a real-captured subset for training and benchmarking. Through extensive experiments in magnifying small-amplitude, high-frequency motions, we demonstrate the effectiveness and accuracy of our dual-camera system and network, offering a cost-effective and flexible solution for motion detection and magnification. Both the code and dataset will be released upon publication.


# 249
Bidirectional Progressive Transformer for Interaction Intention Anticipation

Zichen Zhang · Hongchen Luo · Wei Zhai · Yu Kang · Yang Cao

Interaction intention anticipation aims to predict future hand trajectories and interaction hotspots jointly. Existing research often treated trajectory forecasting and interaction hotspots prediction as separate tasks or solely considered the impact of trajectories on interaction hotspots, which led to the accumulation of prediction errors over time. However, a deeper inherent connection exists between hand trajectories and interaction hotspots, which allows for continuous mutual correction between them. Building upon this relationship, a novel Bidirectional pOgressive Transformer (BOT), which introduces a bidirectional progressive mechanism into the anticipation of interaction intention is established. Initially,(BOT) maximizes the utilization of spatial information from the last observation frame through the Spatial-Temporal Reconstruction Module, mitigating conflicts arising from changes of view in the first-person videos. Subsequently, based on two independent prediction branches, a Bidirectional Progressive Enhancement Module is introduced to mutually improve the prediction of hand trajectories and interaction hotspots over time to minimize error accumulation. Finally, acknowledging the intrinsic randomness in human natural behavior, we employ a Trajectory Stochastic Unit and a C-VAE to introduce appropriate uncertainty to trajectories and interaction hotspots, respectively. Our method achieves state-of-the-art results on three benchmark datasets Epic-Kitchens-100, EGO4D, and EGTEA Gaze+, demonstrating superior in complex scenarios.


# 29
Reinforcement Learning via Auxillary Task Distillation

Abhinav Narayan Harish · Larry Heck · Josiah P Hanna · Zsolt Kira · Andrew Szot

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill); a new method for leveraging reinforcement learning (RL) in long-horizon robotic control problems by distilling behaviors from auxiliary RL tasks. AuxDistill trains pixels-to-actions policies end-to-end with RL, without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves this by concurrently doing multi-task RL in auxiliary tasks which are easier than and relevant to the main task. Behaviors learned in the auxiliary tasks are transferred to solving the main task through a weighted distillation loss. In an embodied object-rearrangement task, we show AuxDistill achieves 27% higher success rate than baselines.


# 233
Strong Double Blind
COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation

Jiefeng Li · Ye Yuan · Davis Rempe · Haotian Zhang · Pavlo Molchanov · Cewu Lu · Jan Kautz · Umar Iqbal

Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.


# 227
EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Human Motion Generation

Wenyang Zhou · Zhiyang Dou · Zeyu Cao · Zhouyingcheng Liao · Jingbo Wang · Wenjia Wang · Yuan Liu · Taku Komura · Wenping Wang · Lingjie Liu

We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality. On the one hand, previous works, like motion latent diffusion, conduct diffusion within a latent space for efficiency, but learning such a latent space can be a non-trivial effort. On the other hand, accelerating generation by naively increasing the sampling step size, e.g., DDIM, often leads to quality degradation as it fails to approximate the complex denoising distribution. To address these issues, we propose EMDM, which captures the complex distribution during multiple sampling steps in the diffusion model, allowing for much fewer sampling steps and significant acceleration in generation. This is achieved by a conditional denoising diffusion GAN to capture multimodal data distributions among arbitrary (and potentially larger) step sizes conditioned on control signals, enabling fewer-step motion sampling with high fidelity and diversity. To minimize undesired motion artifacts, geometric losses are imposed during network learning. As a result, EMDM achieves real-time motion generation and significantly improves the efficiency of motion diffusion models compared to existing methods while achieving high-quality motion generation. Our code will be publicly available upon publication.


# 125
MotionChain: Conversational Motion Controllers via Multimodal Prompts

Biao Jiang · Xin Chen · Chi Zhang · Fukun Yin · Zhuoyuan Li · Gang Yu · Jiayuan Fan

Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.


# 208
Strong Double Blind
M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Seunggeun Chi · Hyung-gun Chi · Hengbo Ma · Nakul Agarwal · Faizan Siddiqui · Karthik Ramani · Kwonjoon Lee

We introduce the \textbf{M}ulti-\textbf{M}otion \textbf{D}iscrete \textbf{D}iffusion \textbf{M}odel (M2D2M), a novel approach in human motion generation from action descriptions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, facilitating nuanced and context-sensitive human motion generation. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model initially trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks in motion generation from the text tasks, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.


# 204
Strong Double Blind
SMooDi: Stylized Motion Diffusion Model

Lei Zhong · Yiming Xie · Varun Jampani · Deqing Sun · Huaizu Jiang

We introduce a novel Stylized Motion Diffusion model, dubbed SMooDi, to generate stylized motion driven by content texts and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. To this end, we tailor a pre-trained text-to-motion model for stylization. Specifically, we propose style guidance to ensure that the generated motion closely matches the reference style, alongside a lightweight style adaptor that directs the motion towards the desired style while ensuring realism. Experiments across various applications demonstrate that our proposed framework outperforms existing methods in stylized motion generation.


# 231
Strong Double Blind
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

Yuanhao Zhai · Kevin Lin · Linjie Li · Chung-Ching Lin · Jianfeng Wang · Zhengyuan Yang · David Doermann · Junsong Yuan · Zicheng Liu · Lijuan Wang

Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow. Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields, leading to harmonized outputs. Additionally, a cross-attention map consistency loss is applied to align the cross-attention map of the video denoising with that of the depth denoising, further facilitating alignment. Extensive experiments on the TikTok and NTU120 datasets show our superior performance, significantly surpassing existing methods in terms of video FVD and depth accuracy.


# 217
Strong Double Blind
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Shaowei Liu · Zhongzheng Ren · Saurabh Gupta · Shenlong Wang

In this paper, we present PhysGen, a novel image-to-video generation method that converts a single image and an input condition-- such as force and torque applied to an object in the image-- to produce a realistic, physically plausable, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through a comprehensive user study. PhysGen's resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Code and data will be made publicly available upon acceptance.


# 202
SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Yeji Song · Wonsik Shin · Junsoo Lee · Jeesoo Kim · Nojun Kwak

Driven by the upsurge progress in text-to-image (T2I) generation models, text-to-video (T2V) generation has experienced a significant advance as well. Accordingly, tasks such as modifying the object or changing the style in a video have been possible. However, previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. In this paper, we spot the bias problem in the existing video editing method that restricts the range of choices for the new protagonist and attempt to address this issue using the conventional image-level personalization method. We adopt motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. To deal with the natural discrepancy between image and video, we propose a motion word with an inflated textual embedding to properly represent the motion in a source video. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow, efficiently computed from the pre-calculated attention maps. Finally, we decouple the motion from the appearance of the source video with an additional pseudo word. Extensive experiments demonstrate the editing capability of our method, taking a step toward more diverse and extensive video editing.


# 201
Strong Double Blind
Kinetic Typography Diffusion Model

Seonmi Park · Inhwan Bae · Seunghyun Shin · HAE-GON JEON

This paper introduces a method for realistic kinetic typography that generates user-preferred animatable "text content". We draw on recent advances in guided video diffusion models to achieve visually-pleasing text appearances. To do this, we first construct a kinetic typography dataset, comprising about 600K videos. Our dataset is made from a variety of combinations in 584 templates designed by professional motion graphics designers and involves changing each letter's position, glyph, and size (i.e., flying, glitches, chromatic aberration, reflecting effects, etc.). Next, we propose a video diffusion model for kinetic typography. For this, there are three requirements: aesthetic appearances, motion effects, and readable letters. This paper identifies the requirements. For this, we present static and dynamic captions used as spatial and temporal guidance of a video diffusion model, respectively. The static caption describes the overall appearance of the video, such as colors, texture and glyph which represent a shape of each letter. The dynamic caption accounts for the movements of letters and backgrounds. We add one more guidance with zero convolution to determine which text content should be visible in the video. We apply the zero convolution to the text content, and impose it on the diffusion model. Lastly, our glyph loss, only minimizing a difference between the predicted word and its ground-truth, is proposed to make the prediction letters readable. Experiments show that our model generates kinetic typography videos with legible and artistic letter motions based on text prompts. For better understanding, we provide our source code and data samples as supplementary materials.


# 229
Strong Double Blind
DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

Xiaojing Zhong · Xinyi Huang · Xiaofeng Yang · Guosheng Lin · Qingyao Wu

Diffusion models usher a new era of video editing, flexibly manipulating the video contents with text prompts. Despite the widespread application demand in editing human-centered videos, these models face significant challenges in handling complex objects like humans. In this paper, we introduce DeCo, a novel video editing framework specifically designed to treat humans and the background as separate editable targets, ensuring global spatial-temporal consistency by maintaining the coherence of each individual component. Specifically, we propose a decoupled dynamic human representation that utilizes a parametric human body prior to generate tailored humans while preserving the consistent motions as the original video. In addition, we consider the background as a layered atlas to apply text-guided image editing approaches on it. To further enhance the geometry and texture of humans during the optimization, we extend the calculation of score distillation sampling into normal space and image space. Moreover, we tackle inconsistent lighting between the edited targets by leveraging a lighting-aware video harmonizer, a problem previously overlooked in decompose-edit-combine approaches. Extensive qualitative and numerical experiments demonstrate that DeCo outperforms prior video editing methods in human-centered videos, especially in longer videos.


# 228
StableDrag: Stable Dragging for Point-based Image Editing

Yutao Cui · Xiaotong Zhao · Guozhen Zhang · Shengming Cao · Kai Ma · Limin Wang

Point-based image editing has attracted remarkable attention since the emergence of DragGAN. Recently, DragDiffusion further pushes forward the generative quality via adapting this dragging technique to diffusion models. Despite these great success, this dragging scheme exhibits two major drawbacks, namely inaccurate point tracking and incomplete motion supervision, which may result in unsatisfactory dragging outcomes. To tackle these issues, we build a stable and precise drag-based editing framework, coined as StableDrag, by designing a discirminative point tracking method and a confidence-based latent enhancement strategy for motion supervision. The former allows us to precisely locate the updated handle points, thereby boosting the stability of long-range manipulation, while the latter is responsible for guaranteeing the optimized latent as high-quality as possible across all the manipulation steps. Thanks to these unique designs, we instantiate two types of image editing models including StableDrag-GAN and StableDrag-Diff, which attains more stable dragging performance, through extensive qualitative experiments and quantitative assessment on DragBench.


# 205
Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

Wonjun Kang · Kevin Galim · Hyung Il Koo

Diffusion models have achieved remarkable success in the domain of text-guided image generation and, more recently, in text-guided image editing. A commonly adopted strategy for editing real images involves inverting the diffusion process to obtain a noisy representation of the original image, which is then denoised to achieve the desired edits. However, current methods for diffusion inversion often struggle to produce edits that are both faithful to the specified text prompt and closely resemble the source image. To overcome these limitations, we introduce a novel and adaptable diffusion inversion technique for real image editing, which is grounded in a theoretical analysis of the role of $\eta$ in the DDIM sampling equation for enhanced editability. By designing a universal diffusion inversion method with a time- and region-dependent $\eta$ function, we enable flexible control over the editing extent. Through a comprehensive series of quantitative and qualitative assessments, involving a comparison with a broad array of recent methods, we demonstrate the superiority of our approach. Our method not only sets a new benchmark in the field but also significantly outperforms existing strategies.


# 196
Curved Diffusion: A Generative Model With Optical Geometry Control

Andrey Voynov · Amir Hertz · Moab Arar · Shlomi Fruchter · Danny Cohen-Or

State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model.


# 195
Tuning-Free Image Customization with Image and Text Guidance

Pengzhi Li · Qiang Nie · Ying Chen · Xi Jiang · Kai Wu · Yuhuan Lin · Yong Liu · Jinlong Peng · Chengjie Wang · Feng Zheng

Despite significant advancements in image customization due to diffusion models, current methods still have several limitations: 1) unintended changes in non-target areas when regenerating the entire image; 2) guidance solely by a reference image or text descriptions; and 3) time-consuming fine-tuning, which limits their practical application. In response, we introduce a tuning-free framework for simultaneous text-image-guided image customization, enabling precise editing of specific image regions within seconds. Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. To achieve this, we propose an innovative attention blending strategy that blends self-attention features in the UNet decoder during the denoising process. To our knowledge, this is the first tuning-free method that concurrently utilizes text and image guidance for specific region image customization. Our approach outperforms previous methods in both subjective and quantitative evaluations, providing an efficient solution for various practical applications, such as image synthesis, design, and creative photography.


# 191
Strong Double Blind
StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Wen Li · Muyuan Fang · Cheng Zou · Biao Gong · Ruobing Zheng · Meng Wang · Jingdong Chen · Ming Yang

Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between image and text control conditions and the potential loss of semantic information from the text prompt. Addressing this issue involves two key challenges. Firstly, how to inject the style representation without compromising the effectiveness of text representation in control. Secondly, how to obtain the accurate style representation from a single reference image. To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer. This alignment effectively minimizes the impact on the effectiveness of text prompts. Furthermore, we collect a well-labeled style dataset named Style30k to train a style feature extractor capable of accurately representing style while excluding other content information. Experimental results demonstrate that our method fully grasps the style characteristics of the reference image, generating appealing images that are consistent with both the target image style and text prompt.


# 197
Strong Double Blind
AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling

Sherry Chen · Yaron Vaxman · Elad Ben Baruch · David Asulin · Aviad Moreshet · Misha Sra · Pradeep Sen

We propose Image Content Appeal Assessment (ICAA), a novel metric focused on quantifying the level of positive interest an image's content generates for viewers, such as the appeal of food in a photograph. This new metric is fundamentally different from traditional Image-Aesthetics Assessment (IAA), which judges an image's artistic quality. While previous studies have often confused the concepts of aesthetics'' andappeal,'' our work addresses this oversight by being the first to study ICAA explicitly. To do this, we propose a novel system that automates dataset creation, avoids extensive manual labeling work, and implements algorithms to estimate and boost content appeal. We use our pipeline to generate two large-scale datasets (70K+ images each) in diverse domains (food and room interior design) to train our models, which revealed little correlation between content appeal and aesthetics. Our user study, with more than 76% of participants preferring the appeal-enhanced images, confirms that our appeal ratings accurately reflect user preferences, establishing ICAA as a unique evaluative criterion.


# 194
Strong Double Blind
DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment

Yunpeng Bai · Xintao Wang · Yanpei Cao · Yixiao Ge · Chun Yuan · Ying Shan

This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. DreamDiffusion leverages pre-trained text-to-image models and employs temporal masked signal modeling to pre-train the EEG encoder for effective and robust EEG representations. Additionally, the method further leverages the CLIP image encoder to provide extra supervision to better align EEG, text, and image embeddings with limited EEG-image pairs. Overall, the proposed method overcomes the challenges of using EEG signals for image generation, such as noise, limited information, and individual differences, and achieves promising results. Quantitative and qualitative results demonstrate the effectiveness of the proposed method as a significant step towards portable and low-cost ''thoughts-to-images", with potential applications in neuroscience and computer vision.


# 190
TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling

Jun Li · Zedong Zhang · Jian Yang

Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called balance swap-sampling. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli.


# 192
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

Zeyu Liu · Weicong Liang · Zhanhao Liang · Chong Luo · Ji Li · Gao Huang · Yuhui Yuan

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.


# 187
Strong Double Blind
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Zhihang Lin · Mingbao Lin · Meng Zhao · Rongrong Ji

This paper attempts to address the object repetition issue in patch-wise higher-resolution image generation. We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation, while no prompt compromises the image details. Therefore, our AccDiffusion, for the first time, proposes to decouple the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch. Besides, AccDiffusion also introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation. Experimental comparison with existing methods demonstrates that our AccDiffusion effectively addresses the issue of repeated object generation and leads to better performance in higher-resolution image generation.


# 193
Strong Double Blind
The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

Yi Yao · Chan-Feng Hsu · Jhe-Hao Lin · Hongxia Xie · Terence Lin · Yi-Ning Huang · Hong-Han Shuai · Wen-Huang Cheng

In spite of recent advancements in text-to-image generation, it still has limitations when it comes to complex, imaginative text prompts. Due to the limited exposure to diverse and complex data in their training sets, text-to-image models often struggle to comprehend the semantics of these difficult prompts, leading to the generation of irrelevant images. This work explores how diffusion models can process and generate images based on prompts requiring artistic creativity or specialized knowledge. Recognizing the absence of a dedicated evaluation framework for such tasks, we introduce a new benchmark, the Realistic-Fantasy Benchmark (RFBench), which blends scenarios from both realistic and fantastical realms. Accordingly, for reality and fantasy scene generation, we propose an innovative training-free approach, Realistic-Fantasy Network (RFNet), that integrates diffusion models with LLMs. Through our proposed RFBench, extensive human evaluations coupled with GPT-based compositional assessments have demonstrated our approach's superiority over other state-of-the-art methods.


# 185
Strong Double Blind
DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution

Shrey Singh · Prateek Keserwani · Masakazu Iwamura · Partha Pratim Roy

Severe blurring of scene text images, resulting in the loss of critical strokes and textual information, has a profound impact on text readability and recognizability. Therefore, scene text image super-resolution, aiming to enhance text resolution and legibility in low-resolution images, is a crucial task. In this paper, we introduce a novel generative model for scene text super-resolution called ``\textit{Diffusion-Conditioned-Diffusion Model} (DCDM).'' The model is designed to learn the distribution of high-resolution images via two conditions: 1) the low-resolution image and 2) the character-level text embedding generated by a latent diffusion text model. The latent diffusion text module is specifically designed to generate character-level text embedding space from the latent space of low-resolution images. Additionally, the character-level CLIP module has been used to align the high-resolution character-level text embeddings with low-resolution embeddings. This ensures visual alignment with the semantics of scene text image characters. Our experiments on the TextZoom dataset demonstrate the superiority of the proposed method to state-of-the-art methods.


# 184
MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

Nithin Gopalakrishnan Nair · Jeya Maria Jose Valanarasu · Vishal Patel

Large diffusion-based Text-to-Image (T2I) has shown impressive generative powers for text-to-image generation and spatially conditioned generation which makes the generated image follow the semantics of the condition. For most applications, we can train the model end-to-end with paired data to obtain photorealistic generation quality. However to add an additional task, one often needs to retrain the model from scratch using paired data across all modalities to retrain good generation performance. In this paper, we tackle this issue and propose a novel strategy to scale a generative model across new tasks with minimal compute. During our experiments, we discovered that the variance maps of intermediate feature maps of diffusion models capture the intensity of conditioning. utilizing this prior information, we propose MaxFusion, an efficient strategy to scale up text-to-image generation models. Specifically, we combine aligned features of multiple models hence bringing a compositional effect. Our fusion strategy can be integrated into off-the-shelf models to enhance their generative prowess. We show the effectiveness of our method by utilizing off-the-shelf models for multi-modal generation to show the effectiveness of our method. To facilitate further research, we will make the code public.


# 188
Strong Double Blind
ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion

Yan Hong · Yuxuan Duan · Bo Zhang · Haoxing Chen · Jun Lan · Huijia Zhu · Weiqiang Wang · Jianfu Zhang

Recent advancements in personalizing text-to-image (T2I) diffusion models have showcased their ability to generate images grounded in personalized visual concepts with just a few user-provided examples. However, these models often face challenges in preserving high visual fidelity, especially when adjusting scenes based on textual descriptions. To tackle this issue, we present ComFusion, an innovative strategy that utilizes pretrained models to create compositions of user-supplied subject images and predefined text scenes. ComFusion incorporates a class-scene prior preservation regularization, utilizing composites of subject class and scene-specific knowledge from pretrained models to boost generation fidelity. Moreover, ComFusion employs coarse-generated images to ensure they are in harmony with both the instance images and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the subject's essence and ensuring scene fidelity. Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.


# 179
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Jian Ma · Chen Chen · Qingsong Xie · Haonan Lu

Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation.


# 180
Strong Double Blind
Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Juntu Zhao · Junyu Deng · Yixin Ye · Chongxuan Li · Zhijie Deng · Dequan Wang

Advancements in text-to-image diffusion models have broadened extensive downstream practical applications, but such models often encounter misalignment issues between text and image. Taking the generation of a combination of two disentangled concepts as an example, say given the prompt “a tea cup of iced coke”, existing models usually generate a glass cup of iced coke because the iced coke usually co-occurs with the glass cup instead of the tea one during model training. The root of such misalignment is attributed to the confusion in the latent semantic space of text-to-image diffusion models, and hence we refer to the “a tea cup of iced coke” phenomenon as Latent Concept Misalignment (LC-Mis). We leverage large language models (LLMs) to thoroughly investigate the scope of LC-Mis, and develop an automated pipeline for aligning the latent semantics of diffusion models to text prompts. Empirical assessments confirm the effectiveness of our approach, substantially reducing LC-Mis errors and enhancing the robustness and versatility of text-to-image diffusion models. Our code and dataset have been available online for reference.


# 183
Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

Siao Tang · Xin Wang · Hong Chen · Chaoyu Guan · Zewen Wu · Yansong Tang · Wenwu Zhu

High computational overhead is a troublesome problem for diffusion models. Recent studies have leveraged post-training quantization (PTQ) to compress diffusion models. However, most of them only focus on unconditional models, leaving the quantization of widely-used pretrained text-to-image models, e.g., Stable Diffusion, largely unexplored. In this paper, we propose a novel post-training quantization method PCR (Progressive Calibration and Relaxing) for text-to-image diffusion models, which consists of a progressive calibration strategy that considers the accumulated quantization error across timesteps, and an activation relaxing strategy that improves the performance with negligible cost. Additionally, we demonstrate the previous metrics for text-to-image diffusion model quantization are not accurate due to the distribution gap. To tackle the problem, we propose a novel QDiffBench benchmark, which utilizes data in the same domain for more accurate evaluation. Besides, QDiffBench also considers the generalization performance of the quantized model outside the calibration dataset. Extensive experiments on Stable Diffusion and Stable Diffusion XL demonstrate the superiority of our method and benchmark. Moreover, we are the first to achieve quantization for Stable Diffusion XL while maintaining the performance.


# 182
Strong Double Blind
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Chao Gong · Kai Chen · Zhipeng Wei · Jingjing Chen · Yu-Gang Jiang

Text-to-image models encounter safety issues, including concerns related to copyright and Not-Safe-For-Work (NSFW) content. Despite several methods have been proposed for erasing inappropriate concepts from diffusion models, they often exhibit incomplete erasure, lack robustness against red-teaming tools, and inadvertently damage generation ability. In this work, we introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model without necessitating additional training. Specifically, RECE efficiently leverages a closed-form solution to compute new embeddings capable of regenerating erased concepts on the model subjected to concept erasure. To mitigate inappropriate content potentially represented by derived embeddings, RECE further projects them onto harmless concepts in cross-attention layers. The generation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts. Besides, to preserve the model's generation ability, RECE introduces an additional regularization term to the closed-form solution, resulting in minimizing the impact on unrelated concepts during the erasure process. Benchmarking against previous approaches, our method achieves more efficient and thorough erasure with small damage to generation ability and demonstrates enhanced robustness against red-teaming tools.


# 181
Distilling Diffusion Models into Conditional GANs

Minguk Kang · Richard Zhang · Connelly Barnes · Sylvain Paris · Suha Kwak · Jaesik Park · Eli Shechtman · Jun-Yan Zhu · Taesung Park

We propose a method to distill a complex multistep diffusion model into a single-step student model, dramatically accelerating inference while preserving image quality. Our approach interprets the diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model’s ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a Perceptual loss in the latent space of the diffusion model with an ensemble of augmentations. Despite dataset construction costs, E-LatentLPIPS converges more efficiently than many existing distillation methods. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. We demonstrate that our one-step generator outperforms all published diffusion distillation models on the zero-shot COCO2014 benchmark.


# 203
Responsible Visual Editing

Minheng Ni · Yeli Shen · Yabin Zhang · Wangmeng Zuo

With the recent advancements in visual synthesis, there is a growing risk of encountering synthesized images with detrimental effects, such as hate, discrimination, and privacy violations. Unfortunately, it remains unexplored on how to avoid synthesizing harmful images and convert them into responsible ones. In this paper, we present responsible visual editing, which modifies risky concepts within an image to more responsible ones with minimal content changes. However, the concepts that need to be edited are often abstract, making them hard to located and modified. To tackle these challenges, we propose a Cognitive Editor (CoEditor) by harnessing the large multimodal models through a two-stage cognitive process: (1) a perceptual cognitive process to locate what to be modified and (2) a behavioral cognitive process to strategize how to modify. To mitigate the negative implications of harmful images on research, we build a transparent and public dataset, namely AltBear, which expresses harmful information using teddy bears instead of humans. Experiments demonstrate that CoEditor can effectively comprehend abstract concepts in complex scenes, significantly surpassing the baseline models for responsible visual editing. Moreover, we find that the AltBear dataset corresponds well to the harmful content found in real images, providing a safe and effective benchmark for future research. Our source code and dataset can be found at https://github.com/kodenii/Responsible-Visual-Editing.


# 198
Strong Double Blind
HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images

Jingmeng Li · Lukang Fu · Surun Yang · Hui Wei

Emerging images (EIs) are a type of stylized image that consists of discrete speckles with irregular shapes and sizes, colored only in black and white. EIs have significant applications that can contribute to the study of perceptual organization in cognitive psychology and serve as a CAPTCHA mechanism. However, generating high-quality EIs from natural images faces the following challenges: 1) color quantization--how to minimize perceptual loss when reducing the color space of a natural image to 1-bit; 2) perceived difficulty adjustment--how to adjust the perceived difficulty for object discovery and recognition. This paper proposes a universal framework HiEI to generate high-quality EIs from natural images, which contains three modules: the human-centered color quantification module (TTNet), the perceived difficulty control (PDC) module, and the template vectorization (TV) module. TTNet and PDC modules are specifically designed to address the aforementioned challenges. Experimental results show that compared to the existing EI generation methods, HiEI can generate EIs with superior content and style quality while offering more flexibility in controlling perceived difficulty. In particular, we experimently demonstrate that EIs generated by HiEI can effectively defend against attacks from deep network-based visual models, confirming their viability as a CAPTCHA mechanism.


# 199
Strong Double Blind
MagicEraser: Erasing Any Objects via Semantics-Aware Control

FAN LI · Zixiao Zhang · Yi Huang · Jianzhuang Liu · Renjing Pei · Bin Shao · Xu Songcen

The traditional image inpainting task aims to restore corrupted regions by referencing surrounding background and foreground. However, the object erasure task, which is in increasing demand, aims to erase objects and generate harmonious background. Previous GAN-based inpainting methods struggle with intricate texture generation. Emerging diffusion model-based algorithms, such as Stable Diffusion Inpainting, exhibit the capability to generate novel content, but they often produce incongruent results at the locations of the erased objects and require high-quality text prompt inputs. To address these challenges, we introduce MagicEraser, a diffusion model-based framework tailored for the object erasure task. It consists of two phases: content initialization and controllable generation. In the latter phase, we develop two plug-and-play modules called prompt tuning and semantics-aware attention refocus. Additionally, we propose a data construction strategy that generates training data specially suitable for this task. MagicEraser achieves fine and effective control of content generation while mitigating undesired artifacts. Experimental results highlight a valuable advancement of our approach in the object erasure task.


# 43
GenQ: Quantization in Low Data Regimes with Generative Synthetic Data

YUHANG LI · Youngeun Kim · Donghyun Lee · Souvik Kundu · Priyadarshini Panda

In the realm of deep neural network deployment, low-bit quantization presents a promising avenue for enhancing computational efficiency. However, it often hinges on the availability of training data to mitigate quantization errors, a significant challenge when data availability is scarce or restricted due to privacy or copyright concerns. Addressing this, we introduce GenQ, a novel approach employing an advanced Generative AI model to generate photorealistic, high-resolution synthetic data, overcoming the limitations of traditional methods that struggle to accurately mimic complex objects in extensive datasets like ImageNet. Our methodology is underscored by two robust filtering mechanisms designed to ensure the synthetic data closely aligns with the intrinsic characteristics of the actual training data. In case of limited data availability, the actual data is used to guide the synthetic data generation process, enhancing fidelity through the inversion of learnable token embeddings. Through rigorous experimentation, GenQ establishes new benchmarks in data-free and data-scarce quantization, significantly outperforming existing methods in accuracy and efficiency, thereby setting a new standard for quantization in low data regimes.


# 168
DiffiT: Diffusion Vision Transformers for Image Generation

Ali Hatamizadeh · Jiaming Song · Guilin Liu · Jan Kautz · Arash Vahdat

Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). Specifically, we propose a methodology for finegrained control of the denoising process and introduce the Time-dependant Multihead Self Attention (TMSA) mechanism. DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency. We also propose latent and image space DiffiT models and show SOTA performance on a variety of class-conditional and unconditional synthesis tasks at different resolutions. The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset while having 139 M and 114 M less parameters than other Transformer-based diffusion models such as MDT and DiT, respectively.


# 169
Strong Double Blind
DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation

Wenliang Zhao · Haolin Wang · Jie Zhou · Jiwen Lu

Diffusion probabilistic models (DPMs) have shown remarkable performance in visual synthesis but are computationally expensive due to the need for multiple evaluations during the sampling. Recent advancements in fast samplers of DPMs have significantly reduced the required number of function evaluations (NFE), and the samplers based on predictor-corrector substantially improve the sampling quality in extremely few NFE (5$\sim$10). However, the predictor-corrector framework inherently suffers from a misalignment issue caused by the extra corrector step, which will harm the sampling performance, especially with a large classifier-free guidance scale (CFG). In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment issue of the predictor-corrector samplers. The dynamic compensation is controlled by compensation ratios that are adaptive to the sampling steps and can be optimized on only 10 datapoints by pushing the sampling trajectory toward a ground truth trajectory. We further propose a cascade polynomial regression (CPR) which can instantly predict the compensation ratios on unseen sampling configurations. Additionally, we find that the proposed dynamic compensation can also serve as a plug-and-play module to boost the performance of predictor-only samplers. Extensive experiments on both unconditional sampling and conditional sampling demonstrate that our DC-Solver can consistently improve the sampling quality over previous methods on different DPMs with a wide range of resolutions up to 1024$\times$1024. Notably, we achieve 10.38 FID (NFE=5) on unconditional FFHQ and 0.394 MSE (NFE=5, CFG=7.5) on Stable-Diffusion-2.1.


# 173
Strong Double Blind
∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

Minh Quan Le · Alexandros Graikos · Srikar Yellapragada · Rajarsi Gupta · Joel Saltz · Dimitris Samaras

Synthesizing high-resolution images from intricate, domain-specific information remains a significant challenge in generative modeling, particularly for applications in large-image domains such as digital histopathology and remote sensing. Existing methods face critical limitations: conditional diffusion models in pixel or latent space cannot exceed the resolution on which they were trained without losing fidelity, and computational demands increase significantly for larger image sizes. Patch-based methods offer computational efficiency but fail to capture long-range spatial relationships due to their overreliance on local information. In this paper, we introduce a novel conditional diffusion model in infinite dimensions, \texttt{$\infty$-Brush} for controllable large image synthesis. We propose a cross-attention neural operator to enable conditioning in function space. Our model overcomes the constraints of traditional finite-dimensional diffusion models and patch-based methods, offering scalability and superior capability in preserving global image structures while maintaining fine details. To the best of our knowledge, \texttt{$\infty$-Brush} is the first conditional diffusion model in function space, that can controllably synthesize images at arbitrary resolutions of up to $4096\times4096$ pixels. The code will be released to the public.


# 167
Unmasking Bias in Diffusion Model Training

Hu Yu · Li Shen · Jie Huang · Hongsheng LI · Feng Zhao

Denoising diffusion models have emerged as a dominant approach for image generation, however they still suffer from slow convergence in training and color shift issues in sampling. In this paper, we identify that these obstacles can be largely attributed to bias and suboptimality inherent in the default training paradigm of diffusion models. Specifically, we offer theoretical insights that the prevailing constant loss weight strategy in $\epsilon$-prediction of diffusion models leads to biased estimation during the training phase, hindering accurate estimations of original images. To address the issue, we propose a simple but effective weighting strategy derived from the unlocked biased part. Furthermore, we conduct a comprehensive and systematic exploration, unraveling the inherent bias problem in terms of its existence, impact and underlying reasons. These analyses contribute to advancing the understanding of diffusion models. Empirical results demonstrate that our method remarkably elevates sample quality and displays improved efficiency in both training and sampling processes, by only adjusting loss weighting strategy.


# 166
Compensation Sampling for Improved Convergence in Diffusion Models

Hui Lu · Albert Ali Salah · Ronald Poppe

Diffusion models achieve remarkable quality in image generation, but at a cost. Iterative denoising requires many time steps to produce high fidelity images. We argue that the denoising process is crucially limited by an accumulation of the reconstruction error due to an initial inaccurate reconstruction of the target data. This leads to lower quality outputs, and slower convergence. To address these issues, we propose compensation sampling to guide the generation towards the target domain. We introduce a compensation term, implemented as a U-Net, which adds negligible computation overhead during training. Our approach is flexible and we demonstrate its application in unconditional generation, face inpainting, and face de-occlusion on benchmark datasets CIFAR-10, CelebA, CelebA-HQ, FFHQ-256, and FSG. Our approach consistently yields state-of-the-art results in terms of image quality, while accelerating the denoising process to converge during training by up to an order of magnitude.


# 297
Strong Double Blind
Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks

Jiawei Wu · Zhi Jin

Recent research tries to extend image restoration capabilities from human perception to machine perception, thereby enhancing the performance of high-level vision tasks in degraded environments. These methods, primarily based on supervised learning, typically involve the retraining of restoration networks or high-level vision networks. However, collecting paired data in real-world scenarios poses a challenge, and with the increasing attention of large-scale models, retraining becomes an inefficient and cost-prohibitive solution. To this end, we propose an unsupervised learning method called Variational Translator (VaT), which does not require retraining existing restoration and high-level vision networks. Instead, it establishes a lightweight network that serves as an intermediate bridge between them. By variational inference, VaT approximates the joint distribution of restoration output and high-level vision input, dividing the optimization objective into preserving content and maximizing marginal likelihood associated with high-level vision tasks. By cleverly leveraging self-training paradigms, VaT achieves the above optimization objective without requiring labels. As a result, the translated images maintain a close resemblance to their original content while also demonstrating exceptional performance on high-level vision tasks. Extensive experiments in dehazing and low-light enhancement for detection and classification show the superiority of our method over other state-of-the-art unsupervised counterparts, even significantly surpassing supervised methods in some complex real-world scenarios.


# 186
Strong Double Blind
Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint

Sixiang Chen · Tian Ye · Kai Zhang · Zhaohu Xing · Yunlong Lin · Lei Zhu

Recent advancements in adverse weather restoration have shown potential, yet the unpredictable and varied combinations of weather degradations in the real world pose significant challenges. Previous methods typically struggle with dynamically handling intricate degradation combinations and carrying on background reconstruction precisely, leading to performance and generalization limitations. Drawing inspiration from prompt learning and the "Teaching Tailored to Talent" concept, we introduce a novel pipeline, T3-DiffWeather. Specifically, we employ a prompt pool that allows the network to autonomously combine sub-prompts to construct weather-prompts, harnessing the necessary attributes to adaptively tackle unforeseen weather input. Moreover, from a scene modeling perspective, we incorporate general prompts constrained by Depth-Anything feature to provide the scene-specific condition for the diffusion process. Furthermore, by incorporating contrastive prompt loss, we ensures distinctive representations for both types of prompts by a mutual pushing strategy. Experimental results demonstrate that our method achieves state-of-the-art performance across various synthetic and real-world datasets, markedly outperforming existing diffusion techniques in terms of computational efficiency.


# 177
Strong Double Blind
Dual-Rain: Video Rain Removal using Assertive and Gentle Teachers

Tingting Chen · Beibei Lin · Yeying Jin · Wending Yan · WEI YE · Yuan Yuan · Robby T. Tan

Existing video deraining methods addressing both rain accumulation and rain streaks rely on synthetic data for training as clear ground-truths are unavailable. Hence, they struggle to handle real-world rain videos due to domain gaps. In this paper, we present Dual-Rain, a novel video deraining method with a two-teacher process. Our novelty lies in our two-teacher framework, featuring an assertive and a gentle teacher. The novel two-teacher removes rain streaks and rain accumulation by learning from real rainy videos without the need for ground-truths. The basic idea of our assertive teacher is to rapidly accumulate knowledge from our student, accelerating deraining capabilities. The key idea of our gentle teacher is to slowly gather knowledge, preventing over-suppression of pixel intensity caused by the assertive teacher. Learning the predictions from both teachers allows the student to effectively learn from less challenging regions and gradually address more challenging regions in real-world rain videos, without requiring their corresponding ground-truths. Once high-confidence rain-free regions from our two-teacher are obtained, we augment their corresponding inputs to generate challenging inputs. Our student is then trained on these inputs to iteratively address more challenging regions. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real-world videos quantitatively and qualitatively, outperforming the state-of-the-art by 11% of PSNR on the SynHeavyRain dataset.


# 165
A Comparative Study of Image Restoration Networks for General Backbone Network Design

Xiangyu Chen · Zheyuan Li · Yuandong Pu · Yihao Liu · Jiantao Zhou · Yu Qiao · Chao Dong

Despite the significant progress made by deep models in various image restoration tasks, existing image restoration networks still face challenges in terms of task generality. An intuitive manifestation is that networks which excel in certain tasks often fail to deliver satisfactory results in others. To illustrate this point, we select five representative networks and conduct a comparative study on five classic image restoration tasks. First, we provide a detailed explanation of the characteristics of different image restoration tasks and backbone networks. Following this, we present the benchmark results and analyze the reasons behind the performance disparity of different models across various tasks. Drawing from this comparative study, we propose that a general image restoration backbone network needs to meet the functional requirements of diverse tasks. Based on this principle, we design a new general image restoration backbone network, X-Restormer. Extensive experiments demonstrate that X-Restormer possesses good task generality and achieves state-of-the-art performance across a variety of tasks.


# 171
Strong Double Blind
OAPT: Offset-Aware Partition Transformer for Double JPEG Artifacts Removal

Qiao Mo · Yukang Ding · Jinhua Hao · Qiang Zhu · Ming Sun · Chao Zhou · Feiyu Chen · Shuyuan Zhu

Deep learning-based methods have shown remarkable performance in single JPEG artifacts removal task. However, existing methods tend to degrade on double JPEG images, which are prevalent in real-world scenarios. To address this issue, we propose Offset-Aware Partition Transformer for double JPEG artifacts removal, termed as OAPT. We conduct an analysis of double JPEG compression that results in up to four patterns within each 8x8 block and design our model to cluster the similar patterns to reduce the difficulty of restoration. Our OAPT consists of two components: compression offsets predictor and image reconstructor. Specifically, the predictor estimates pixel offsets between the first and second compression, which are then utilized to divide different patterns. The reconstructor is mainly based on several Hybrid Partition Attention Blocks (HPAB), combining vanilla window-based self-attention and sparse attention for clustered pattern features. Extensive experiments demonstrate that OAPT outperforms state-of-the-art methods by more than 0.16dB in double JPEG image restoration task. Moreover, without increasing any computation cost, the pattern clustering module in HPAB can serve as a plugin to enhance other transformer-based image restoration methods. The code will be available at https://github.com/OAPT/OAPT.git.


# 174
Strong Double Blind
Domain-adaptive Video Deblurring via Test-time Blurring

Jin-Ting He · Fu-Jen Tsai · Jia-Hao Wu · Yan-Tsung Peng · Chung-Chi Tsai · Chia-Wen Lin · Yen-Yu Lin

Dynamic scene video deblurring aims to remove undesirable blurry artifacts captured during the exposure process. Although previous video deblurring methods have achieved impressive results, they still suffer from significant performance drops due to the domain gap between training and testing videos, especially for those captured in real-world scenarios. To address this issue, we propose a domain adaptation scheme based on a reblurring model to achieve test-time fine-tuning for deblurring models in unseen domains. Since blurred and sharp pairs are unavailable for fine-tuning during inference, our scheme can generate domain-adaptive training pairs to calibrate a deblurring model for the target domain. First, a Relatively-Sharp Detection Module is proposed to identify relatively sharp regions from the blurry input images and regard them as pseudo-sharp images. Next, we utilize a reblurring model to produce blurred images based on the pseudo-sharp images extracted during testing. To synthesize blurred images in compliance with the target data distribution, we propose a Domain-adaptive Blur Condition Generation Module to create domain-specific blur conditions for the reblurring model. Finally, the generated pseudo-sharp and blurred pairs are used to fine-tune a deblurring model for better performance. Extensive experimental results demonstrate that our approach can significantly improve state-of-the-art video deblurring methods, providing performance gains of up to 7.54dB on various real-world video deblurring datasets.


# 163
Kernel Diffusion: An Alternate Approach to Blind Deconvolution

Yash Sanghvi · Yiheng Chi · Stanley Chan

Blind deconvolution problems are severely ill-posed because neither the underlying signal nor the forward operator are not known exactly. Conventionally, these problems are solved by alternating between estimation of the image and kernel while keeping the other fixed. In this paper, we show that this framework is flawed because of its tendency to get trapped in local minima and, instead, suggest the use of a kernel estimation strategy with a non-blind solver. This framework is employed by a diffusion method which is trained to sample the blur kernel from the conditional distribution with guidance from a pre-trained non-blind solver. The proposed diffusion method leads to state-of-the-art results on both synthetic and real blur datasets.


# 172
Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

Claudio Rota · Marco Buzzelli · Joost Van de Weijer

In this paper, we address the problem of enhancing perceptual quality in video super-resolution (VSR) using Diffusion Models (DMs) while ensuring temporal consistency among frames. We present StableVSR, a VSR method based on DMs that can significantly enhance the perceptual quality of upscaled videos by synthesizing realistic and temporally-consistent details. We introduce the Temporal Conditioning Module (TCM) into a pre-trained DM for single image super-resolution to turn it into a VSR method. TCM uses the novel Temporal Texture Guidance, which provides it with spatially-aligned and detail-rich texture information synthesized in adjacent frames. This guides the generative process of the current frame toward high-quality and temporally-consistent results. In addition, we introduce the novel Frame-wise Bidirectional Sampling strategy to encourage the use of information from past to future and vice-versa. This strategy improves the perceptual quality of the results and the temporal consistency across frames. We demonstrate the effectiveness of StableVSR in enhancing the perceptual quality of upscaled videos while achieving better temporal consistency compared to existing state-of-the-art methods for VSR.


# 162
Strong Double Blind
Kalman-Inspired Feature Propagation for Video Face Super-Resolution

Ruicheng Feng · Chongyi Li · Chen Change Loy

Despite the promising progress of face image super-resolution, video face super-resolution remains relatively under-explored. Existing approaches either adapt general video super-resolution networks to face datasets or apply established face image super-resolution models independently on individual video frames. These paradigms encounter challenges either in reconstructing facial details or maintaining temporal consistency. To address these issues, we introduce a novel framework called Kalman-inspired Feature Propagation (KEEP), designed to maintain a stable face prior over time. The Kalman filtering principles offer our method a recurrent ability to use the information from previously restored frames to guide and regulate the restoration process of the current frame. Extensive experiments demonstrate the effectiveness of our method in capturing facial details consistently across video frames. Both code and model will be released.


# 170
Strong Double Blind
RealViformer: Investigating Attention for Real-World Video Super-Resolution

Yuehan Zhang · Angela Yao

In real-world video super-resolution (VSR), videos suffer from in-the-wild degradations and artifacts. VSR methods, especially recurrent ones, tend to propagate artifacts over time in the real-world setting and are more vulnerable than image super-resolution. This paper investigates the influence of artifacts on commonly used covariance-based attention mechanisms in VSR. Comparing the widely-used spatial attention, which computes covariance over space, versus the less common channel attention, we observe that the latter is less sensitive to artifacts. However, channel attention leads to feature redundancy - evidenced by the higher covariance among output channels. As such, we explore simple techniques such as the squeeze-excite mechanism and covariance-based rescaling to counter the effects of high channel covariance. Based on our findings, we propose RealViformer. This channel-attention-based real-world VSR framework surpasses state-of-the-art on two real-world VSR datasets with fewer parameters and faster runtimes.


# 160
Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence

Hongyuan Wang · Lizhi Wang · Jiang Xu · Chang Chen · Xue Hu · Fenglong Song · Youliang Yan

Spectral super-resolution that aims to recover hyperspectral image (HSI) from easily obtainable RGB image has drawn increasing interest in the field of computational photography. The crucial aspect of spectral super-resolution lies in exploiting the correlation within HSIs. However, two types of bottlenecks in existing Transformers limit performance improvement and practical applications. First, existing Transformers often separately emphasize either spatial-wise or spectral-wise correlation, disrupting the 3D features of HSI and hindering the exploitation of unified spatial-spectral correlation. Second, existing self-attention mechanism always establishes full-rank correlation matrix by learning the correlation between pairs of tokens, leading to its inability to describe linear dependence widely existing in HSI among multiple tokens. To address these issues, we propose a novel Exhaustive Correlation Transformer (ECT) for spectral super-resolution. First, we propose a Spectral-wise Discontinuous 3D (SD3D) splitting strategy, which models unified spatial-spectral correlation by integrating spatial-wise continuous splitting strategy and spectral-wise discontinuous splitting strategy. Second, we propose a Dynamic Low-Rank Mapping (DLRM) model, which captures linear dependence among multiple tokens through a dynamically calculated low-rank dependence map. By integrating unified spatial-spectral attention and linear dependence, our ECT can model exhaustive correlation within HSI. The experimental results on both simulated and real data indicate that our method achieves state-of-the-art performance. Codes and pretrained models are available at https://anonymous.4open.science/r/ECT-FC89.


# 164
Strong Double Blind
Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems

Yasar Utku Alcalar · Mehmet Akcakaya

Diffusion models have emerged as powerful generative techniques for solving inverse problems. Despite their success in a variety of inverse problems in imaging, these models require many steps to converge, leading to slow inference time. Recently, there has been a trend in diffusion models for employing sophisticated noise schedules that involve more frequent iterations of timesteps at lower noise levels, thereby improving image generation and convergence speed. However, application of these ideas to solving inverse problems with diffusion models remain challenging, as these noise schedules do not perform well when using empirical tuning for the forward model log-likelihood term weights. To tackle these challenges, we propose zero-shot approximate diffusion posterior sampling (ZAPS) that leverages connections to zero-shot physics-driven deep learning. ZAPS uses the recently proposed diffusion posterior sampling (DPS), and fixes the number of sampling steps. Subsequently it uses zero-shot training with a physics-guided loss function to learn log-likelihood weights at each irregular timestep. We further approximate the Hessian of the logarithm of the prior using a diagonalization approach with learnable diagonal entries for computational efficiency. These parameters are optimized over a fixed number of epochs with a given computational budget. Our results for various noisy inverse problems, including Gaussian and motion deblurring, inpainting, and super-resolution show that ZAPS reduces inference time, provides robustness to irregular noise schedules and improves reconstruction quality.


# 288
Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction

Jeffrey Wen · Rizwan Ahmad · Phillip Schniter

In imaging inverse problems, one seeks to recover an image from missing/corrupted measurements. Because such problems are ill-posed, there is great motivation to quantify the uncertainty induced by the measurement-and-recovery process. Motivated by applications where the recovered image is used for a downstream task, such as soft-output classification, we propose a task-centered approach to uncertainty quantification. In particular, we use conformal prediction to construct an interval that is guaranteed to contain the task output from the true image up to a user-specified probability, and we use the width of that interval to quantify the uncertainty contributed by measurement-and-recovery. For posterior-sampling-based image recovery, we construct locally adaptive prediction intervals. Furthermore, we propose to collect measurements over multiple rounds, stopping as soon as the task uncertainty falls below an acceptable level. We demonstrate our methodology on accelerated magnetic resonance imaging (MRI).


# 150
Strong Double Blind
Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction

Bingyu Xin · Meng Ye · Leon Axel · Dimitris N. Metaxas

Magnetic Resonance Imaging (MRI) is a widely used imaging modality for clinical diagnostics and the planning of surgical interventions. Accelerated MRI seeks to mitigate the inherent limitation of long scanning time by reducing the amount of raw $k$-space data required for image reconstruction. Recently, the deep unrolled model (DUM) has demonstrated significant effectiveness and improved interpretability for MRI reconstruction, by truncating and unrolling the conventional iterative reconstruction algorithms with deep neural networks. However, the potential of DUM for MRI reconstruction has not been fully exploited. In this paper, we first enhance the gradient and information flow within and across iteration stages of DUM, then we highlight the importance of using various adjacent information for accurate and memory-efficient sensitivity map estimation and improved multi-coil MRI reconstruction. Extensive experiments on several public MRI reconstruction datasets show that our method outperforms existing MRI reconstruction methods by a large margin. The code will be released publicly.


# 140
Strong Double Blind
Wavelet Convolutions for Large Receptive Fields

Shahaf Finder · Roy Amoyal · Eran Treister · Oren Freifeld

In recent years, there have been attempts to increase the kernel size of Convolutional Neural Nets (CNNs) to mimic the global receptive field of Vision Transformers' (ViTs) self-attention blocks. That approach, however, quickly hit an upper bound and saturated way before achieving a global receptive field. In this work, we demonstrate that by leveraging the Wavelet Transform (WT), it is, in fact, possible to obtain very large receptive fields without suffering from over-parameterization, e.g., for a k times k receptive field, the number of trainable parameters in the proposed method grows only logarithmically with k. The proposed layer, named WTConv, can be used as a drop-in replacement in existing architectures, results in an effective multi-frequency response, and scales gracefully with the size of the receptive field. We demonstrate the effectiveness of the WTConv layer within ConvNeXt and MobileNetV2 architectures for image classification, as well as backbones for downstream tasks, and show it yields additional properties such as robustness to image corruption and increased response to shapes over textures. Our code will be released upon acceptance.


# 234
Strong Double Blind
Long-term Temporal Context Gathering for Neural Video Compression

Linfeng Qi · Zhaoyang Jia · Jiahao Li · Bin Li · Houqiang Li · Yan Lu

Most existing neural video codecs (NVCs) only extract short-term temporal context by optical flow-based motion compensation. However, such short-term temporal context suffers from error propagation and lacks awareness of long-term relevant information. This limits their performance, particularly in a long prediction chain. In this paper, we address the issue by facilitating the synergy of both long-term and short-term temporal contexts during feature propagation. Specifically, we introduce a Long-term Temporal Context Gathering (LTCG) module to search the diverse and relevant context from the long-term reference feature. The searched long-term context is leveraged to refine the feature propagation by integrating into the short-term reference feature, which can enhance the reconstruction quality and mitigate the propagation errors. During the search process, how to distinguish the helpful context and filter the irrelevant information is challenging and vital. To this end, we cluster the reference feature and perform the searching process in an intra-cluster fashion to improve the context mining. This synergistic integration of long-term and short-term temporal contexts can significantly enhance the temporal correlation modeling. Additionally, to improve the probability estimation in variable-bitrate coding, we introduce the quantization parameter as an extra prior to the entropy model. Comprehensive evaluations demonstrate the effectiveness of our method, which offers an average 11.3\% bitrate saving over the ECM on 1080p video datasets, using the single intra-frame setting.


# 142
Strong Double Blind
Implicit Neural Models to Extract Heart Rate from Video

Pradyumna Chari · Anirudh Bindiganavale Harish · Adnan Armouti · Alexander Vilesov · Sanjit Sarda · Laleh Jalilian · Achuta Kadambi

Scene representation networks, or implicit neural representations (INR) have seen a range of success in numerous image and video applications. However, being universal function fitters, they fit all variations in videos without any selectivity. This is particularly a problem for tasks such as remote plethysmography, the extraction of heart rate information from face videos. As a result of low native signal to noise ratio, previous signal processing techniques suffer from poor performance, while previous learning-based methods have improved performance but suffer from hallucinations that mitigate generalizability. Directly applying prior INRs cannot remedy this signal strength deficit, since they fit to both the signal as well as interfering factors. In this work, we introduce an INR framework that increases this plethysmograph signal strength. Specifically, we leverage architectures to have selective representation capabilities. We are able to decompose the face video into a blood plethysmograph component, and a face appearance component. By inferring the plethysmograph signal from this blood component, we show state-of-the-art performance on out-of-distribution samples without sacrificing performance for in-distribution samples. We implement our framework on a custom-built multiresolution hash encoding backbone to enable practical dataset-scale representations through a 50x speed-up over traditional INRs. We also present a dataset of optically challenging out-of-distribution scenes to test generalization to real-world scenarios. Code and data may be found at https://implicitppg.github.io/.


# 176
A Watermark-Conditioned Diffusion Model for IP Protection

Rui Min · Sen Li · Hongyang Chen · Minhao Cheng

The ethical need to protect AI-generated content has been a significant concern in recent years. While existing watermarking strategies have demonstrated success in detecting synthetic content (detection), there has been limited exploration in identifying the users responsible for generating these outputs from a single model (owner identification). In this paper, we focus on both practical scenarios and propose a unified watermarking framework for content copyright protection within the context of diffusion models. Specifically, we consider two parties: the model provider, who grants public access to a diffusion model via an API, and the users, who can solely query the model API and generate images in a black-box manner. Our task is to embed hidden information into the generated contents, which facilitates further detection and owner identification. To tackle this challenge, we propose a Watermark-conditioned Diffusion model called WaDiff, which manipulates the watermark as a conditioned input and incorporates fingerprinting into the generation process. All the generative outputs from our WaDiff carry user-specific information, which can be recovered by an image extractor and further facilitate forensic identification. Extensive experiments are conducted on two popular diffusion models, and we demonstrate that our method is effective and robust in both the detection and owner identification tasks. Meanwhile, our watermarking framework only exerts a negligible impact on the original generation and is more stealthy and efficient in comparison to existing watermarking strategies.


# 284
Strong Double Blind
Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures

Jiaxing Huang · Yanfeng Zhou · Yaoru Luo · Guole Liu · Heng Guo · Ge Yang

Accurate segmentation of long and thin tubular structures is required in a wide variety of areas such as biology, medicine, and remote sensing. The complex topology and geometry of such structures often pose significant technical challenges. A fundamental property of such structures is their topological self-similarity, which can be quantified by fractal features such as fractal dimension (FD). In this study, we incorporate fractal features into a deep learning model by extending FD to the pixel-level using a sliding window technique. The resulting fractal feature maps (FFMs) are then incorporated as additional input to the model and additional weight in the loss function to enhance segmentation performance by utilizing the topological self-similarity. Moreover, we extend the U-Net architecture by incorporating an edge decoder and a skeleton decoder to improve boundary accuracy and skeletal continuity of segmentation, respectively. Extensive experiments on five tubular structure datasets validate the effectiveness and robustness of our approach. Furthermore, the integration of FFMs with other popular segmentation models such as HR-Net also yields performance enhancement, suggesting FFM can be incorporated as a plug-in module with different model architectures. Code and data are openly accessible at https://xxx.


# 54
Strong Double Blind
Image Manipulation Detection With Implicit Neural Representation and Limited Supervision

Zhenfei Zhang · Mingyang Li · Xin Li · Ming-Ching Chang · Jun-Wei Hsieh

Image Manipulation Detection (IMD) is becoming increasingly important as tampering technologies advance. However, most state-of-the-art (SoTA) methods require high-quality training datasets featuring image- and pixel-level annotations. The effectiveness of these methods suffers when applied to manipulated or noisy samples that differ from the training data. To address these challenges, we present a unified framework that combines unsupervised and weakly supervised approaches for IMD. Our approach introduces a novel pre-processing stage based on a controllable fitting function from Implicit Neural Representation (INR). Additionally, we introduce a new selective pixel-level contrastive learning approach, which concentrates exclusively on high-confidence regions, thereby mitigating uncertainty associated with the absence of pixel-level labels. In weakly supervised mode, we utilize ground-truth image-level labels to guide predictions from an adaptive pooling method, facilitating comprehensive exploration of manipulation regions for image-level detection. The unsupervised model is trained using a self-distillation training method with selected high-confidence pseudo-labels obtained from the deepest layers via different sources. Extensive experiments demonstrate that our proposed method outperforms existing unsupervised and weakly supervised methods. Moreover, it competes effectively against fully supervised methods on novel manipulation detection tasks.


# 42
DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks

Caixin Kang · Yinpeng Dong · Zhengyi Wang · Shouwei Ruan · Yubo Chen · Hang Su · Xingxing Wei

Adversarial attacks, particularly patch attacks, pose significant threats to the robustness and reliability of deep learning models. Developing reliable defenses against patch attacks is crucial for real-world applications. This paper introduces DIFFender, a novel defense framework that harnesses the capabilities of a text-guided diffusion model to combat patch attacks. Central to our approach is the discovery of the Adversarial Anomaly Perception (AAP) phenomenon, which empowers the diffusion model to detect and localize adversarial patches through the analysis of distributional discrepancies. DIFFender integrates dual tasks of patch localization and restoration within a single diffusion model framework, utilizing their close interaction to enhance defense efficacy. Moreover, DIFFender utilizes vision-language pre-training coupled with an efficient few-shot prompt-tuning algorithm, which streamlines the adaptation of the pre-trained diffusion model to defense tasks, thus eliminating the need for extensive retraining. Our comprehensive evaluation spans image classification and face recognition tasks, extending to real-world scenarios, where DIFFender shows good robustness against adversarial attacks. The versatility and generalizability of DIFFender are evident across a variety of settings, classifiers, and attack methodologies, marking an advancement in adversarial patch defense strategies.


# 143
Strong Double Blind
Learning Natural Consistency Representation for Face Forgery Video Detection

Daichi Zhang · Zihao Xiao · Shikun Li · Fanzhao Lin · Jianmin Li · Shiming Ge

Face Forgery videos have elicited critical social public concerns and various detectors have been proposed. However, fully-supervised detectors may lead to easily overfitting to specific forgery methods or videos, and existing self-supervised detectors are strict on auxiliary tasks, such as requiring audio or multi-modalities, leading to limited generalization and robustness. In this paper, we examine whether we can solve this issue by leveraging visual-only real face videos. To this end, we propose to learn the Natural Consistency representation (NACO) of real face videos in a self-supervised manner, which is inspired by the observation that fake videos struggle to maintain the natural spatiotemporal consistency even under unknown forgery methods and different perturbations. Our NACO first extracts spatial features of each frame by CNNs then integrates them into Transformer to learn the long-range spatiotemporal representation, leveraging the advantages of CNNs and Transformer on local spatial receptive field and long-term memory respectively. Furthermore, a Spatial Predictive Module (SPM) and a Temporal Contrastive Module (TCM) are introduced to enhance the natural consistency representation learning. The SPM aims to predict random masked spatial features from spatiotemporal representation, and the TCM regularizes the latent distance of spatiotemporal representation by shuffling the natural order to disturb the consistency, which could both force our NACO more sensitive to the natural spatiotemporal consistency. After the representation learning stage, a MLP head is fine-tuned to perform the usual forgery video classification task. Extensive experiments show that our method outperforms other state-of-the-art competitors with impressive generalization and robustness.


# 259
Strong Double Blind
ARoFace: Alignment Robustness to Improve Low-quality Face Recognition

Mohammad Saeed Ebrahimi Saadabadi · Sahar Rahimi Malakshan · Ali Dabouei · Nasser Nasrabadi

Aiming to enhance Face Recognition (FR) performance on Low-Quality (LQ) inputs, recent studies suggest incorporating synthetic LQ samples into training. Although promising, the quality factors that are considered in these works are general rather than FR-specific, \eg, atmospheric turbulence, blur, resolution, \etc. Motivated by our observation of the vulnerability of current FR models to even mild Face Alignment Errors (FAE) in LQ images, we present a simple yet effective method that considers FAE as another quality factor that is tailored to FR. We seek to improve LQ FR by enhancing FR models' robustness to FAE. To this aim, we formalize the problem as a combination of differentiable spatial transformations and adversarial data augmentation in FR. We perturb the alignment of training samples using a controllable spatial transformation and enrich the training with samples expressing FAE. We demonstrate the benefits of the proposed method by conducting evaluations on IJB-B, IJB-C, IJB-S (+4.3\% Rank1), and TinyFace (+2.63\%).


# 141
AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Kaishen Yuan · Zitong Yu · Xin Liu · Weicheng Xie · Huanjing Yue · Jingyu Yang

Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable parameters on scarce AU-annotated datasets or heavy reliance on substantial additional relevant data. Parameter-Efficient Transfer Learning (PETL) provides a promising paradigm to address these challenges, whereas its existing methods lack design for AU characteristics. Therefore, we innovatively investigate PETL paradigm to AU detection, introducing AUFormer and proposing a novel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism. An individual MoKE specific to a certain AU with minimal learnable parameters first integrates personalized multi-scale and correlation knowledge. Then the MoKE collaborates with other MoKEs in the expert group to obtain aggregated information and inject it into the frozen Vision Transformer (ViT) to achieve parameter-efficient AU detection. Additionally, we design a Margin-truncated Difficulty-aware Weighted Asymmetric Loss (MDWA-Loss), which can encourage the model to focus more on activated AUs, differentiate the difficulty of unactivated AUs, and discard potential mislabeled samples. Extensive experiments from various perspectives, including within-domain, cross-domain, data efficiency, and micro-expression domain, demonstrate AUFormer's state-of-the-art performance and robust generalization abilities without relying on additional relevant data. The code will be available on GitHub.


# 256
Strong Double Blind
PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

Risa Shinoda · Kaede Shiohara

Automated animal face identification plays a crucial role in the monitoring of behaviors, conducting of surveys, and finding of lost animals. Despite the advancements in human face identification, the lack of datasets and benchmarks in the animal domain has impeded progress. In this paper, we introduce the PetFace dataset, a comprehensive resource for animal face identification encompassing 257,484 unique individuals across 13 animal families and 319 breed categories, including both experimental and pet animals. This large-scale collection of individuals facilitates the investigation of unseen animal face verification, an area that has not been sufficiently explored in existing datasets due to the limited number of individuals. Moreover, PetFace also has fine-grained annotations such as sex, breed, color, and pattern. We provide multiple benchmarks including re-identification for seen individuals and verification for unseen individuals. The models trained on our dataset outperform those trained on prior datasets, even for detailed breed variations and unseen animal families. Our result also indicates that there is some room to improve the performance of integrated identification on multiple animal families. We hope the PetFace dataset will facilitate animal face identification and encourage the development of non-invasive animal automatic identification methods.


# 137
Strong Double Blind
Enhancing Cross-Subject fMRI-to-Video Decoding with Global-Local Functional Alignment

Chong Li · Xuelin Qian · Yun Wang · Jingyang Huo · Xiangyang Xue · Yanwei Fu · Jianfeng Feng

Advancements in brain imaging enable the decoding of thoughts and intentions from neural activities. However, the fMRI-to-video decoding of brain signals across multiple subjects encounters challenges arising from structural and coding disparities among individual brains, further compounded by the scarcity of paired fMRI-stimulus data. Addressing this issue, this paper introduces the fMRI Global-Local Functional Alignment (GLFA) projection, a novel approach that aligns fMRIs from diverse subjects into a unified brain space, thereby enhancing cross-subject decoding. Additionally, we present a meticulously curated fMRI-video paired dataset comprising a total of 75k fMRI-stimulus paired samples from 8 individuals. This dataset is approximately 4.5 times larger than the previous benchmark dataset. Building on this, we augment a transformer-based fMRI encoder with a diffusion video generator, delving into the realm of cross-subject fMRI-based video reconstruction. This innovative methodology faithfully captures semantic information from diverse brain signals, resulting in the generation of vivid videos and achieving an impressive average accuracy of 84.7\% in cross-subject semantic classification tasks.


# 230
Strong Double Blind
Occlusion-Aware Seamless Segmentation

Yihong Cao · Jiaming Zhang · Hao Shi · Kunyu Peng · Yuhongxuan Zhang · Hui Zhang · Rainer Stiefelhagen · Kailun Yang

Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Blending Panoramic Amodal Seamless Segmentation, ie, BlendPASS. Besides, we propose the first solution UnmaskFormer, aiming at unmasking the narrow FoV, occlusions, and domain gaps all at once. Specifically, UnmaskFormer includes the crucial designs of Unmasking Attention (UA) and Amodal-oriented Mix (AoMix). Our method achieves state-of-the-art performance on the BlendPASS dataset, reaching a remarkable mAPQ of 26.58% and mIoU of 43.66%. On public panoramic semantic segmentation datasets, ie, SynPASS and DensePASS, our method outperforms previous methods and obtains 45.34% and 48.08% in mIoU, respectively. The fresh BlendPASS dataset and our source code will be made publicly available.


# 257
Strong Double Blind
Keypoint Promptable Re-Identification

Vladimir Somers · Alexandre ALahi · Christophe De Vleeschouwer

Occluded Person Re-Identification (ReID) is a metric learning task that involves matching occluded individuals based on their appearance. While many studies have tackled occlusions caused by objects, multi-person occlusions remain less explored. In this work, we identify and address a critical challenge overlooked by previous occluded ReID methods: the Multi-Person Ambiguity (MPA) arising when multiple individuals are visible in the same bounding box, making it impossible to determine the intended ReID target among the candidates. Inspired by recent work on prompting in vision, we introduce Keypoint Promptable ReID (KPR), a novel formulation of the ReID problem that explicitly complements the input bounding box with a set of semantic keypoints indicating the intended target. Since promptable re-identification is an unexplored paradigm, existing ReID datasets lack the pixel-level annotations necessary for prompting. To bridge this gap and foster further research on this topic, we introduce Occluded PoseTrack-ReID, a novel ReID dataset with keypoints labels, that features strong inter-person occlusions. Furthermore, we release custom keypoint labels for four popular ReID benchmarks. Experiments on person retrieval, but also on pose tracking, demonstrate that our method systematically surpasses previous state-of-the-art approaches on various occluded scenarios.


# 253
CoTracker: It is Better to Track Together

Nikita Karaev · Ignacio Rocco · Ben Graham · Natalia Neverova · Andrea Vedaldi · Christian Rupprecht

We introduce CoTracker, a transformer-based model that tracks dense points in a frame jointly across a video sequence. This differs from most existing state-of-the-art approaches that track points independently, ignoring their correlation. We show that joint tracking results in a significantly higher tracking accuracy and robustness. We also provide several technical innovations, including the concept of virtual tracks, which allows CoTracker to track 70k points jointly and simultaneously. Furthermore, CoTracker operates causally on short windows (hence, it is suitable for online tasks), but is trained by unrolling the windows across longer video sequences, which enables and significantly improves long-term tracking. We demonstrate qualitatively impressive tracking results, where points can be tracked for a long time even when they are occluded or leave the field of view. Quantitatively, CoTracker outperforms all recent trackers on standard bench- marks, often by a substantial margin.


# 246
Free Lunch for Gait Recognition: A Novel Relation Descriptor

Jilong Wang · Saihui Hou · Yan Huang · Chunshui Cao · Xu Liu · Yongzhen Huang · Tianzhu Zhang · Liang Wang

Gait recognition is to seek correct matches for query individuals by their unique walking patterns. However, current methods focus solely on extracting individual-specific features, overlooking ``interpersonal" relationships. In this paper, we propose a novel Relation Descriptor that captures not only individual features but also relations between test gaits and pre-selected gait anchors. Specifically, we reinterpret classifier weights as gait anchors and compute similarity scores between test features and these anchors, which re-expresses individual gait features into a similarity relation distribution. In essence, the relation descriptor offers a holistic perspective that leverages the collective knowledge stored within the classifier's weights, emphasizing meaningful patterns and enhancing robustness. Despite its potential, relation descriptor poses dimensionality challenges since its dimension depends on the training set's identity count. To address this, we propose Farthest gait-Anchor Selection to identify the most discriminative gait anchors and an Orthogonal Regularization Loss to increase diversity within gait anchors. Compared to individual-specific features extracted from the backbone, our relation descriptor can boost the performance nearly without any extra costs. We evaluate the effectiveness of our method on the popular GREW, Gait3D, OU-MVLP, CASIA-B, and CCPG, showing that our method consistently outperforms the baselines and achieves state-of-the-art performance.


# 236
Strong Double Blind
S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition

Mohamed Abdelfattah · Alexandre ALahi

Masked self-reconstruction of joints has been shown to be a promising pretext task for self-supervised skeletal action recognition. However, this task focuses on predicting isolated, potentially noisy, joint coordinates, which results in inefficient utilization of the model capacity. In this paper, we introduce S-JEPA, Skeleton Joint Embedding Predictive Architecture, which uses a novel pretext task: Given a partial 2D skeleton sequence, our objective is to predict the latent representations of its 3D counterparts. Such representations serve as abstract prediction targets that direct the modelling power towards learning the high-level context and depth information, instead of unnecessary low-level details. To tackle the potential non-uniformity in these representations, we propose a simple centering operation that is found to benefit training stability, effectively leading to strong off-the-shelf action representations. Extensive experiments show that S-JEPA, combined with the vanilla transformer, outperforms previous state-of-the-art results on NTU60, NTU120, and PKU-MMD datasets. Codes will be available upon publication.


# 238
SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

Jeonghyeok Do · Munchurl Kim

Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.


# 248
Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition

Sumin Lee · Yooseung Wang · Sangmin Woo · Changick Kim

Panoramic Activity Recognition (PAR) seeks to identify diverse human activities across different scales, from individual actions to social group and global activities in crowded panoramic scenes. PAR presents two major challenges: 1) recognizing the nuanced interactions among numerous individuals and 2) understanding multi-granular human activities. To address these, we propose Social Proximity-aware Dual-Path Network (SPDP-Net) based on two key design principles. First, while previous works often focus on spatial distance among individuals within an image, we argue to consider the spatio-temporal proximity. It is crucial for individual relation encoding to correctly understand social dynamics. Secondly, deviating from existing hierarchical approaches (individual-to-social-to-global activity), we introduce a dual-path architecture for multi-granular activity recognition. This architecture comprises individual-to-global and individual-to-social paths, mutually reinforcing each other's task with global-local context through multiple layers. Through extensive experiments, we validate the effectiveness of the spatio-temporal proximity among individuals and the dual-path architecture in PAR. Furthermore, SPDP-Net achieves new state-of-the-art performance with 46.5% of overall F1 score on JRDB-PAR dataset.


# 31
Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

Masashi Hatano · Ryo Hachiuma · Ryo Fujii · Hideo Saito

We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (e.g., daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference speed. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model's adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving 2.2 times faster inference speed.


# 240
Strong Double Blind
Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment

Zhanzhong Pang · Fadime Sener · Shrinivas Ramasubramanian · Angela Yao

Temporal action segmentation assigns an action label to each frame in untrimmed procedural activity videos. Such videos often exhibit a long-tailed action distribution due to the varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods typically overlook the long-tail problem and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and fall short in identifying tail classes when applied to temporal action segmentation frameworks. To address these, we propose a novel group-wise temporal logit adjustment~(G-TLA) framework that combines a group-wise softmax formulation leveraging activity information with temporal logit adjustment which utilizes action order. We evaluate our approach across five temporal segmentation benchmarks, where our framework demonstrates significant improvements in tail recognition, highlighting its effectiveness.


# 61
Look Around and Learn: Self-Training Object Detection by Exploration

Gianluca Scarpellini · Stefano Rosa · Pietro Morerio · Lorenzo Natale · Alessio Del Bue

When an object detector is deployed in a novel setting it often experiences a drop in performance. This paper studies how an embodied agent can automatically fine-tune a pre-existing object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., a fully self-supervised approach. In our setting, an agent initially learns to explore the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we learn the exploration policy "Look Around" to mine hard samples, and we devise a novel mechanism called "Disagreement Reconciliation" for producing refined pseudo-labels from the consensus among observations. We implement a unified benchmark of the current state-of-the-art and compare our approach with pre-existing exploration policies and perception mechanisms. Our method is shown to outperform existing approaches, improving the object detector by 6.2% in a simulated scenario, a 3.59% advancement over other state-of-the-art methods, and by 9.97% in the real robotic test without relying on ground-truth. Code for the proposed approach and implemented baselines together with the dataset will be made available upon paper acceptance.


# 247
Strong Double Blind
Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition

Yisong Wang · Nan Xi · Jingjing Meng · Junsong Yuan

Understanding human-object interaction (HOI) in videos represents a fundamental yet intricate challenge in computer vision, requiring perception and reasoning across both spatial and temporal domains. Despite previous success of object detection and tracking, multi-person video HOI recognition still faces two major challenges: (1) the three facets of HOI (human, objects, and the interactions that bind them) exhibit interconnectedness and exert mutual influence upon one another. (2) the complexity of multi-person multi-object combinations in spatio-temporal interaction. To address them, we design a spatio-temporal context fuser to better model the interactions among persons and objects in videos. Furthermore, to equip the model with temporal reasoning capacity, we propose an interaction state reasoner module on top of context fuser. Considering the interaction is a key element to bind human and object, we propose an interaction-centric hypersphere in the feature embedding space to model each category of interaction. It helps to learn the distribution of HOI samples belonging to the same interactions on the hypersphere. After training, each interaction prototype sphere will fit the testing HOI sample to determine the HOI classification result. Empirical results on multi-person video HOI dataset MPHOI-72 indicate that our method remarkably surpasses state-of-the-art (SOTA) method by more than 22% F_1 score. At the same time, on single-person datasets Bimanual Actions (single-human two-hand HOI) and CAD-120 (single-human HOI), our method achieves on par or even better results compared with SOTA methods.


# 239
Strong Double Blind
Self-Supervised Video Copy Localization with Regional Token Representation

Minlong Lu · Yichen Lu · Siwei Nie · Xudong Yang · Xiaobo Zhang

The task of video copy localization aims at finding the start and end timestamps of all copied segments within a pair of untrimmed videos. Recent approaches usually extract frame-level features and generate a frame-to-frame similarity map for the video pair. Learned detectors are used to identify distinctive patterns in the similarity map to localize the copied segments. There are two major limitations associated with these methods. First, they often rely on a single feature for each frame, which is inadequate in capturing local information for typical scenarios in video copy editing, such as picture-in-picture cases. Second, the training of the detectors requires a significant amount of human annotated data, which is highly expensive and time-consuming to acquire. In this paper, we propose a self-supervised video copy localization framework to tackle these issues. We incorporate a Regional Token into the Vision Transformer, which learns to focus on local regions within each frame using an asymmetric training procedure. A novel strategy that leverages the Transitivity Property is proposed to generate copied video pairs automatically, which facilitates the training of the detector. Extensive experiments and visualizations demonstrate the effectiveness of the proposed approach, which is able to outperform the state-of-the-art without using any human annotated data.


# 232
Strong Double Blind
General and Task-Oriented Video Segmentation

Mu Chen · Liulei Li · Wenguan Wang · Ruijie Quan · Yi Yang

We present GvSeg, a generalist video segmentation framework for addressing four different video segmentation tasks (i.e., instance, semantic, panoptic, and exemplar-guided) while maintaining an identical architectural design. Currently, there is a trend towards developing general video segmentation solutions that can be applied across multiple tasks. This streamlines research endeavors and simplifies deployment. However, such a highly homogenized framework in current design, where each element maintains uniformity, could overlook the inherent diversity among different tasks and lead to suboptimal performance. To tackle this, GvSeg: i) provides a holistic disentanglement and modeling for segment targets, thoroughly examining them from the perspective of appearance, position, and shape, and on this basis, ii) reformulates the query initialization, matching and sampling strategies in alignment with the task-specific requirement. These architecture-agnostic innovations empower GvSeg to effectively address each unique task by accommodating the specific properties that characterize them. Extensive experiments on seven gold-standard benchmark datasets demonstrate that GvSeg surpasses all existing specialized/general solutions by a significant margin on four different video segmentation tasks. Our code shall be released.


# 78
Strong Double Blind
Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Hao Fang · Peng Wu · Yawei Li · Xinxin Zhang · Xiankai Lu

Open-Vocabulary Video Instance Segmentation(VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects. However, the recent Open-Vocabulary VIS attempts obtained unsatisfactory results, especially in terms of generalization ability of novel categories. We discover that the domain gap between the VLM features and the instance queries and the underutilization of temporal consistency are two central causes. To mitigate these issues, we design and train a novel Open-Vocabulary VIS baseline called OVFormer. OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings to remedy the domain gap. Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video. Without bells and whistles, OVFormer achieves 21.9 mAP with a ResNet-50 backbone on LV-VIS, exceeding the previous state-of-the-art performance by +7.7(an improvement of 54% over OV2Seg). Extensive experiments on some Close-Vocabulary VIS datasets also demonstrate the strong zero-shot generalization ability of OVFormer (+7.6 mAP on YouTube-VIS 2019, +3.9 mAP on OVIS). Code is available at https://github.com/Anonymous668/OVFormer.


# 46
Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Alexandre Eymaël · Renaud Vandeghen · Anthony Cioppa · Silvio Giancola · Bernard Ghanem · Marc Van Droogenbroeck

Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn objects from motion, but rather thanks to the Siamese architecture. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.


# 241
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan · Mohaiminul Islam · Thomas Seidl · Gedas Bertasius

Locating specific moments within long videos (20–120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5–30 seconds) grounding methods to this problem yields poor performance. Since most real-life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D. The code is available in the supplementary materials.


# 134
Strong Double Blind
Referring Atomic Video Action Recognition

Kunyu Peng · Jia Fu · Kailun Yang · Di Wen · Yufan Chen · Ruiping Liu · Junwei Zheng · Jiaming Zhang · Saquib Sarfraz · Rainer Stiefelhagen · Alina Roitberg

We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available.


# 132
Elysium: Exploring Object-level Perception in Videos through Semantic Integration Using MLLMs

Han Wang · Yanjie Wang · Ye Yongjie · Yuxiang Nie · Can Huang

Multimodal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains relatively unexplored. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that makes the first attempt to conduct object-level tasks in videos without requiring any additional plug-in or expert models. All codes and datasets will be released soon.


# 131
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang · Yuhui Zhang · Orr Zohar · Serena Yeung-Levy

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.


# 133
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Shicheng Li · Lei Li · Yi Liu · Shuhuai Ren · Yuanxin Liu · Rundong Gao · Xu Sun · Lu Hou

The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research. Our dataset is publicly available at https://github.com/lscpku/VITATECS.


# 127
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

Xiuyuan Chen · Yuan Lin · Yuchen Zhang · Weiran Huang

We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eight large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses.


# 128
Strong Double Blind
Learning Video Context as Interleaved Multimodal Sequences

Qinghong Lin · Pengchuan Zhang · Difei Gao · Xide Xia · Joya Chen · Ziteng Gao · Jinheng Xie · Xuhong Xiao · Mike Zheng Shou

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who~\cite{autoad2}, relationship~\cite{lfvu}, and reason~\cite{movieqa}). In this paper, we introduce~{\our}, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate \our's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classifcation, audio description, video-text retrieval, video captioning, and video question-answering).


# 130
Strong Double Blind
Multi-Modal Video Dialog State Tracking in the Wild

Mohamed Adnen Abdessaied · Lei Shi · Andreas Bulling

We present MSTMIXER – a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new SOTA results on five challenging benchmarks.


# 123
Towards Multimodal Sentiment Analysis Debiasing via Bias Purification

Dingkang Yang · Mingcheng Li · Dongling Xiao · Yang Liu · Kun Yang · Zhaoyu Chen · Yuzheng Wang · Peng Zhai · Ke Li · Lihua Zhang

Multimodal Sentiment Analysis (MSA) aims to understand human intentions by integrating emotion-related clues from diverse modalities, such as visual, language, and audio. Unfortunately, the current MSA task invariably suffers from unplanned dataset biases, particularly multimodal utterance-level label bias and word-level context bias. These harmful biases potentially mislead models to focus on statistical shortcuts and spurious correlations, causing severe performance bottlenecks. To alleviate these issues, we present a Multimodal Counterfactual Inference Sentiment (MCIS) analysis framework based on causality rather than conventional likelihood. Concretely, we first formulate a causal graph to discover harmful biases from already-trained vanilla models. In the inference phase, given a factual multimodal input, MCIS imagines two counterfactual scenarios to purify and mitigate these biases. Then, MCIS can make unbiased decisions from biased observations by comparing factual and counterfactual outcomes. We conduct extensive experiments on several standard MSA benchmarks. Qualitative and quantitative results show the effectiveness of the proposed framework.


# 206
Strong Double Blind
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Jian Ma · Wenguan Wang · Yi Yang · Feng Zheng

Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenge ing to acquire, impeding the utilization of extensive unpaired data. In this paper, we introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks and overcome data scarcity. Furthermore, we employ the diffusion model as foundational conditional converters to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Specifically, MVSD employs two converters: one for VAM called reverberator and one for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-way unpaired data. Extensive experiments on two standard benchmarks, i.e., SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can improve the performance of the reverberator and dereverberator and better match specified visual scenarios. Remarkably, the performance of the models can be further enhanced by adding more unpaired data.


# 1
Strong Double Blind
Rethinking Normalization Layers for Domain Generalizable Person Re-identification

Ren Nie · Jin Ding · Xue Zhou · Xi Li

Domain Generalizable Person Re-Identification (DG-ReID) strives to transfer learned feature representation from source domains to unseen target domains, despite significant distribution shifts. While most existing methods enhance model generalization and discriminative feature extraction capability by introducing Instance Normalization (IN) in combination with Batch Normalization (BN), these approaches still struggle with the overfitting of normalization layers to the source domains, posing challenges in domain generalization. To address this issue, we propose ReNorm, a purely normalization-based framework that integrates two complementary normalization layers through two forward propagations for the same weight matrix. In the first forward propagation, Remix Normalization (RN) combines IN and BN in a concise manner to ensure the feature extraction capability. As an effective complement to RN, Emulation Normalization (EN) simulates the testing process of RN, implicitly mitigating the domain shifts caused by the absence of target domain information and actively guiding the model in learning how to generalize the feature extraction capability to unseen target domains. Meanwhile, we propose Domain Frozen (DF), freezing updates to affine parameters to reduce the impact of these less robust parameters on overfitting to the source domains. Extensive experiments show that our framework achieves state-of-the-art performance among all the popular benchmarks. Code will be released upon publication.


# 59
Strong Double Blind
Dual-stage Hyperspectral Image Classification Model with Spectral Supertoken

Peifu Liu · Tingfa Xu · Jie Wang · Huan Chen · Huiyan Bai · Jianan Li

Hyperspectral image classification, a task that assigns pre-defined classes to each pixel in a hyperspectral image of remote sensing scenes, often faces challenges due to the neglect of correlations between spectrally similar pixels. This oversight can lead to inaccurate edge definitions and difficulties in managing minor spectral variations in contiguous areas. To address these issues, we introduce the novel Dual-stage Spectral Supertoken Classifier (DSTC), inspired by superpixel concepts. DSTC employs spectrum-derivative-based pixel clustering to group pixels with similar spectral characteristics into spectral supertokens. By projecting the classification of these tokens onto the image space, we achieve pixel-level results that maintain regional classification consistency and precise boundary. Moreover, recognizing the diversity within tokens, we propose a class-proportion-based soft label. This label adaptively assigns weights to different categories based on their prevalence, effectively managing data distribution imbalances and enhancing classification performance. Comprehensive experiments on WHU-OHS, IP, KSC, and UP datasets corroborate the robust classification capabilities of DSTC and the effectiveness of its individual components. The code will be made publicly available.


# 50
Strong Double Blind
Learning Representations of Satellite Images From Metadata Supervision

Jules Bourcier · Gohar Dashyan · Karteek Alahari · Jocelyn Chanussot

Self-supervised learning is increasingly applied to Earth observation problems that leverage satellite and other remotely sensed data. Within satellite imagery, metadata such as time and location often hold significant semantic information that improves scene understanding. In this paper, we introduce Satellite Metadata-Image Pretraining (SatMIP), a new approach for harnessing metadata in the pretraining phase through a flexible and unified multimodal learning objective. SatMIP represents metadata as textual captions and aligns images with metadata in a shared embedding space by solving a metadata-image contrastive task. Our model learns a non-trivial image representation that can effectively handle recognition tasks. We further enhance this model by combining image self-supervision and metadata supervision, introducing SatMIPS. As a result, SatMIPS improves over its image-image pretraining baseline, SimCLR, and accelerates convergence. Comparison against four recent contrastive and masked autoencoding-based methods for remote sensing also highlight the efficacy of our approach. Furthermore, we find that metadata supervision yields better scalability to larger backbones, and more robust hierarchical pretraining.


# 48
Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring

Sizhuo Li · Dimitri Gominski · Martin Brandt · Xiaoye Tong · Philippe Ciais

Image-level regression is an important task in Earth observation, where visual domain and label shifts are a core challenge hampering generalization. However, cross-domain regression with remote sensing data remains understudied due to the absence of suited datasets. We introduce a new dataset with aerial and satellite imagery in five countries with three forest-related regression tasks. To match real-world applicative interests, we compare methods through a restrictive setup where no prior on the target domain is available during training, and models are adapted with limited information during testing. Building on the assumption that ordered relationships generalize better, we propose manifold diffusion for regression as a strong baseline for transduction in low-data regimes. Our comparison highlights the comparative advantages of inductive and transductive methods in cross-domain regression.


# 260
Strong Double Blind
Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition

Sergio Izquierdo · Javier Civera

Visual Place Recognition (VPR) plays a critical role in many localization and mapping pipelines. It consists of retrieving the closest sample to a query image, in a certain embedding space, from a database of geotagged references. The image embedding is learned to effectively describe a place despite variations in visual appearance, viewpoint, and geometric changes. In this work, we formulate how limitations in the Geographic Distance Sensitivity of current VPR embeddings result in a high probability of incorrectly sorting the top-k retrievals, negatively impacting the recall. In order to address this issue in single-stage VPR, we propose a novel mining strategy, CliqueMining, that selects positive and negative examples by sampling cliques from a graph of visually similar images. Our approach boosts the sensitivity of VPR embeddings at small distance ranges, significantly improving the state of the art on relevant benchmarks. In particular, we raise recall@1 from 75% to 82% in MSLS Challenge, and from 76% to 90% in Nordland. Our models and code will be released upon acceptance and are provided as supplemental material.


# 138
AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale

Adam Pardyl · Michal Wronka · Maciej Wołczyk · Kamil Adamczewski · Tomasz Trzcinski · Bartosz Zielinski

Active Visual Exploration (AVE) is a task that involves dynamically selecting observations (glimpses), which is critical to facilitate comprehension and navigation within an environment. While modern AVE methods have demonstrated impressive performance, they are constrained to fixed-scale glimpses from rigid grids. In contrast, existing mobile platforms equipped with optical zoom capabilities can capture glimpses of arbitrary positions and scales. To address this gap between software and hardware capabilities, we introduce AdaGlimpse. It uses Soft Actor-Critic, a reinforcement learning algorithm tailored for exploration tasks, to select glimpses of arbitrary position and scale. This approach enables our model to rapidly establish a general awareness of the environment before zooming in for detailed analysis. Experimental results demonstrate that AdaGlimpse surpasses previous methods across various visual tasks while maintaining greater applicability in realistic AVE scenarios.


# 120
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Pengxiang Ding · Han Zhao · Wenjie Zhang · Wenxuan Song · Min Zhang · Siteng Huang · Ningxi Yang · Donglin Wang

The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a VLA model to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation shows that our approach leads to performant robotic policies and enables QUART to obtain a range of generalization capabilities.


# 121
Strong Double Blind
Navigation Instruction Generation with BEV Perception and Large Language Models

Sheng Fan · Rui Liu · Wenguan Wang · Yi Yang

Navigation instruction generation, which requires embodied agents to describe the navigation routes, has been of great interest in robotics and human-computer interaction. Existing studies directly map the sequence of 2D perspective observations to route descriptions. Though straightforward, they overlook the geometric information and object semantics of the 3D environment. To address these challenges, we propose BEVInstructor that incorporates Bird’s Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for embodied instruction generation. Specifically, BEVInstructor constructs a Perspective-BEV Visual Encoder to boost the comprehension of 3D environments by fusing the BEV and perspective features. The fused embeddings are served as visual prompts for MLLMs. To leverage the powerful language capabilities of MLLMs, perspective-BEV prompt tuning is proposed for parameter-efficient updating. Based on the perspective-BEV prompts, we further devise an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner. BEVInstructor achieves impressive performance across diverse datasets (i.e., R2R, REVERIE, and UrbanWalk). Our code will be released.


# 124
V-IRL: Grounding Virtual Intelligence in Real Life

Jihan YANG · Runyu Ding · Ellis L Brown · Qi Xiaojuan · Saining Xie

There is a sensory gulf between the Earth that humans inhabit and the digital realms in which modern AI agents are created. To develop AI agents that can sense, think, and act as flexibly as humans in real-world settings, it is imperative to bridge the realism gap between the digital and physical worlds. How can we embody agents in an environment as rich and diverse as the one we inhabit, without the constraints imposed by real hardware and control? Towards this end, we introduce \virl: a platform that enables agents to scalably interact with the real world in a virtual yet realistic environment. Our platform serves as a playground for developing agents that can accomplish various practical tasks and as a vast testbed for measuring progress in capabilities spanning perception, decision-making, and interaction with real-world data across the entire globe.


# 118
M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions

Mingsheng Li · Xin Chen · Chi Zhang · Sijin Chen · HONGYUAN ZHU · Fukun Yin · Zhuoyuan Li · Gang Yu · Tao Chen

Recently, the understanding of the 3D world has garnered increased attention, facilitating autonomous agents to perform further decision-making. However, the majority of existing 3D vision-language datasets and methods are often limited to specific tasks, limiting their applicability in diverse scenarios. The recent advance of Large Language Models (LLMs) and Multi-modal Language Models (MLMs) has shown mighty capability in solving various language and image tasks. Therefore, it is interesting to unlock MLM’s potential to be an omni 3D assistant for wider tasks. However, current MLMs’ research has been less focused on 3D due to the scarcity of large-scale visual-language datasets. In this work, we introduce M3DBench, a comprehensive multi-modal instruction dataset for complex 3D environments with over 320k instruction-response pairs that: 1) supports general interleaved multi-modal instructions with text, user clicks, images, and other visual prompts, 2) unifies diverse region- and scene-level 3D tasks, composing various fundamental abilities in real-world 3D environments. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding interleaved multi-modal instructions. With extensive quantitative and qualitative experiments, we show the effectiveness of our dataset and baseline model in understanding complex human-environment interactions and accomplishing general 3D-centric tasks. We will release the data and code to accelerate future research on developing 3D MLMs.


# 122
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Raghav Kapoor · Yash Parag Butala · Melisa A Russak · Jing Yu Koh · Kiran Kamble · Waseem AlShikh · Ruslan Salakhutdinov

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.


# 112
Unifying 3D Vision-Language Understanding via Promptable Queries

ziyu zhu · Zhuofan Zhang · Xiaojian Ma · Xuesong Niu · Yixin Chen · Baoxiong Jia · Zhidong Deng · Siyuan Huang · Qing Li

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying different 3D scene representations (\ie, voxels, point clouds, multi-view images) into a common 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates superior performance on most tasks, setting new records on most benchmarks. Particularly, PQ3D boosts the state-of-the-art on ScanNet200 by 1.8% (AP), ScanRefer by 5.4% (acc@0.5), Multi3DRef by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with whatever 3D representations are available, e.g., solely relying on voxels.


# 119
UMBRAE: Unified Multimodal Brain Decoding

Weihao Xia · Raoul de Charette · Cengiz Oztireli · Jing-Hao Xue

In this paper, we aim to tackle the two prevailing challenges in brain-powered research. To extract instance-level conceptual and spatial details from neural signals, we introduce an efficient universal brain encoder for multimodal-brain alignment and recover object descriptions at multiple levels of granularity from subsequent multimodal large language models. To overcome unique brain patterns of different individuals, we introduce a cross-subject training strategy. This allows neural signals from multiple subjects to be trained within the same model without additional training resources or time, and benefits from user diversity, yielding better results than focusing on a single subject. To better assess our method, we have introduced a comprehensive brain understanding benchmark BrainHub. Experiments demonstrate that our proposed method UMBRAE not only achieves superior results in the newly introduced tasks but also outperforms models in established tasks with recognized metrics. Code and data will be made publicly available to facilitate further research.


# 126
Strong Double Blind
BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

Hee Suk Yoon · Eunseop Yoon · Joshua Tian Jin Tee · Kang Zhang · Yu-Jung Heo · Du-Seong Chang · Chang Yoo

Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both based on the dialogue context. Due to the lack of a large-scale dataset specifically for this task and the benefits of leveraging powerful pre-trained models, previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach. However, this approach can overlook crucial information about the image, hindering 1) image-grounded text response and 2) consistency of objects in the image response. In this paper, we propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content and the consistency of objects in sequential image responses. Through extensive experiments on the multimodal dialogue benchmark dataset, we show that BI-MDRG can effectively increase the quality of multimodal dialogue. Additionally, recognizing the gap in benchmark datasets for evaluating the image consistency in multimodal dialogue, we have created a curated set of 300 dialogues annotated to track object consistency across conversations.


# 115
CoReS: Orchestrating the Dance of Reasoning and Segmentation

Xiaoyi Bao · Siyang Sun · Shuailei Ma · Kecheng Zheng · Yuxin Guo · Guosheng Zhao · Yun Zheng · Xingang Wang

The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM's outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 7.1% on the ReasonSeg dataset. The code will be released soon.


# 107
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Tianhe Wu · Kede Ma · Jie Liang · Yujiu Yang · Yabin Zhang

While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, the potential they hold as powerful, flexible and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive study of prompting MLLMs for IQA at the system level. Specifically, we first investigate nine system-level prompting methods for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting tricks in natural language processing (i.e., standard, in-context, and chain-of-thought prompting). We then propose a difficult sample selection procedure, taking into account sample diversity and human uncertainty, to further challenge MLLMs coupled with the respective optimal prompting methods identified in the previous step. In our experiments, we assess three open-source and one close-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, color differences, and geometric transformations) under both full-reference and no-reference settings, and gain valuable insights into the development of better MLLMs for IQA.


# 113
Grounding Language Models for Visual Entity Recognition

Zilin Xiao · Ming Gong · Paola Cascante-Bonilla · Xingyao Zhang · Jie Wu · Vicente Ordonez

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visual reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark with accuracy on the Entity seen split rising from 32.7% to 61.5%. It demonstrates superior performance on the unseen and query splits by a substantial double-digit margin, while also preserving the ability to effectively transfer to other generic visual question answering benchmarks without further training.


# 109
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Chuofan Ma · Yi Jiang · Jiannan Wu · Zehuan Yuan · Qi Xiaojuan

We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/.


# 111
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

Qinyu Zhao · Ming Xu · Kartik Gupta · Akshay Asthana · Liang Zheng · Stephen Gould

Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layer of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against multi-modal jailbreaking attack, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, indicating potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicting uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improve models' performance but is still inferior to linear probing on these tasks.


# 114
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

Yu Wang · Xiaogeng Liu · Yu Li · Muhao Chen · Chaowei Xiao

With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), the imperative to ensure their safety has become increasingly pronounced. However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g., ``harmful text'') has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. Specifically, we propose Adaptive Shield Prompting (AdaShield), which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks without fine-tuning MLLMs or training additional modules (e.g., post-stage content detector). Initially, we present a manually designed static defense prompt, which thoroughly examines the image and instruction content step by step and specifies response methods to malicious queries. Furthermore, we introduce an adaptive auto-refinement framework, consisting of a target MLLM and a LLM-based defense prompt generator (Defender). These components collaboratively and iteratively communicate to generate a defense prompt. Extensive experiments on the popular structure-based jailbreak attacks and benign datasets show that our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks without compromising the model's general capabilities evaluated on standard benign tasks.


# 102
UniCode : Learning a Unified Codebook for Multimodal Large Language Models

Sipeng Zheng · Bohan Zhou · Yicheng Feng · Ye Wang · Zongqing Lu

In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images. The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates exceptional capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks. Our codes are available at \url{https://anonymous.4open.science/r/UniCode}.


# 99
Strong Double Blind
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Swetha Sirnam · Jinyu Yang · Tal Neiman · Mamshad Nayeem Rizve · Son Tran · Benjamin Yao · Trishul A Chilimbi · Shah Mubarak

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by incorporating visual perceptioning capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and fine-grained representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative interaction mechanism. Specifically, X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders, i.e., CLIP-ViT (CL-based) \cite{radford2021clip} and MAE-ViT (MIM-based) \cite{he2022mae}. It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM. To demonstrate the effectiveness of our approach, we assess its performance on tasks demanding fine-grained visual understanding. Our extensive empirical evaluations indicate that X-Former excels in visual reasoning tasks encompassing both structural and semantic categories within the GQA dataset. Assessment on a fine-grained visual perception benchmark further confirms its superior capabilities in visual understanding.


# 88
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

jiazhou zhou · Xu Zheng · Yuanhuiyi Lyu · LIN WANG

In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-language models (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial. Intuitively, we need to address two key challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile generates event prompts for modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability across diverse datasets. With the proposed event encoder, text encoder, and image encoder, a novel Hierarchical Triple Contrastive Alignment (HTCA}) module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We evaluate various settings, including fine-tuning and few-shot on three benchmarks and our EventBind achieves new state-of-art accuracy compared with the previous methods, such as N-Caltech101 (+5.34% and +1.70%) and N-Imagenet (+5.65% and +1.99%) with fine-tuning and 20-shot settings respectively. Moreover, our EventBind can be flexibly extended to the event retrieval task using text or image queries, showing plausible performance. Our project code will be made publicly available.


# 245
Strong Double Blind
EDformer: Transformer-Based Event Denoising Across Varied Noise Levels

Bin Jiang · Bo Xiong · Bohan Qu · M. Salman Asif · You Zhou · Zhan Ma

Currently, there is relatively limited research on the non-informative background activity noise of event cameras in different brightness conditions, and the relevant real-world datasets is extremely scarce. This limitation contributes to the lack of robustness in existing event denoising algorithms when applied in practical scenarios. This paper addresses this gap by collecting and analyzing background activity noise from the DAVIS346 event camera under different illumination conditions. We introduce the first real-world event denoising dataset, ED24, encompassing 21 noise levels and noise annotations. Furthermore, we propose EDformer, an innovative event-by-event denoising model based on transformer. This model excels in event denoising by learning the spatiotemporal correlations among events across varied noise levels. In comparison to existing denoising algorithms, the proposed EDformer achieves state-of-the-art performance in denoising accuracy, including open-source datasets and datasets captured in practical scenarios with low-light intensity requirements such as zebrafish blood vessels imaging.


# 90
Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Kaiwen Cai · ZheKai Duan · Gaowen Liu · Charles Fleming · Chris Xiaoxuan Lu

Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices, yet challenges in handling diverse visual modalities, manual annotation, and computational constraints remain. We introduce EdgeVL, a novel framework that bridges this gap by seamlessly integrating dual-modality knowledge distillation and quantization-aware contrastive learning. This approach enables the adaptation of large VL models, like CLIP, for efficient use with both RGB and non-RGB images on resource-limited devices without the need for manual annotations. EdgeVL not only transfers visual language alignment capabilities to compact models but also maintains feature quality post-quantization, significantly enhancing open-vocabulary classification performance across various visual modalities. Our work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.


# 100
Strong Double Blind
The Hard Positive Truth about Vision-Language Compositionality

Amita Kamath · Cheng-Yu Hsieh · Kai-Wei Chang · Ranjay Krishna

Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have been overstated --- because existing benchmarks do not probe whether finetuned models remain invariant to hard positives. By curating an evaluation dataset with 112,382 both hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 training set with both hard negatives and hard positives captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating an improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related ``positive'' concepts.


# 101
Strong Double Blind
HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs

Ziwei Yao · Ruiping Wang · Xilin CHEN

With the advancements of vision-language models, the growing demand for generating customized image descriptions under length, target regions, and other various control conditions brings new challenges for evaluation. Most existing metrics, designed primarily for single-sentence image captioning with an overall matching score, struggle to accommodate complex description requirements, resulting in insufficient accuracy and interpretability of evaluation. Therefore, we propose HiFi-Score, a hierarchical parsing graph-based fine-grained evaluation metric. Specifically, we model both text and images as parsing graphs, which organize multi-granular instances into a hierarchical structure according to their inclusion relationships, which provides a comprehensive scene analysis for both modalities from global to local. Based on the fine-grained matching between the graphs, we can evaluate the fidelity to ensure text contents are related to image and the adequacy to ensure the image is covered by text at multiple levels. Furthermore, we employ the large language model to evaluate fluency of the language expression. Human correlation experiments on four caption-level benchmarks show that the proposed metric outperforms existing metrics. At the paragraph-level, we construct a novel dataset ParaEval and demonstrate the accuracy of the HiFi-Score in evaluating long texts. We further show its superiority in assessing vision-language models and its flexibility when applied to various image description tasks.


# 104
Strong Double Blind
LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang

Yuqing Zhang · Hangqi Li · Shengyu Zhang · Runzhong Wang · Baoyi He · Huaiyong Dou · Junchi Yan · Yongquan Zhang · Fei Wu

Restoring ancient manuscripts fragments, such as those from Dunhuang, is crucial for preserving human historical culture. However, their worldwide dispersal and the shifts in cultural and historical contexts pose significant restoration challenges. Traditional archaeological efforts primarily focus on manually piecing major fragments together, yet the vast majority of small, more intricate pieces remain largely unexplored, which is technically due to their irregular shapes, sparse textual content, and extensive combinatorial space for reassembly. In this paper, we formalize the task of restoring the ancient manuscript from fragments as a cardinality-constrained combinatorial optimization problem, and propose a general framework named LLMCO4MS: (Multimodal) Large Language Model-aided Combinatorial Optimization Neural Networks for Ancient Manuscript Restoration. Specifically, LLMCO4MS encapsulates a neural combinatorial solver equipped with a differentiable optimal transport (OT) layer, to efficiently predict the Top-K likely reassembly candidates. Innovatively, the Multimodal Large Language Model (MLLM) is then adopted and prompted to yield pairwise matching confidence and relative directions for final restoration. Extensive experiments on both synthetic data and real-world famous Dunhuang fragments demonstrate superior performance of our approach. In particular, LLMCO4MS has facilitated the discovery of previously unknown civilian economic documents from the 10-th century in real-world applications.


# 97
Language-Image Pre-training with Long Captions

Kecheng Zheng · Yifei Zhang · Wei Wu · Fan Lu · Shuailei Ma · Xin Jin · Wei Chen · Yujun Shen

Language-image pre-training largely relies on how precisely and thoroughly a text describes its paired image. In practice, however, the contents of an image can be so rich that well describing them requires lengthy captions (\textit{e.g.}, with 10 sentences), which are usually missing in existing datasets. Consequently, there are currently no clear evidences on whether and how language-image pre-training could benefit from long captions. To figure this out, we first re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM), and then study the usage of the resulting captions under a contrastive learning framework. We observe that, each sentence within a long caption is very likely to describe the image partially (\textit{e.g.}, an object). Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs, and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. Experimental results on a wide rage of downstream tasks demonstrate the consistent superiority of our method, termed DreamLIP, over previous alternatives, highlighting its fine-grained representational capacity. It is noteworthy that, on the tasks of image-text retrieval and semantic segmentation, our model trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs. The code and generated captions will be made publicly available.


# 96
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

Chenglin Yang · Siyuan Qiao · Yuan Cao · Yu Zhang · Tao Zhu · Alan Yuille · Jiahui Yu

Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or additional modules. Specifically, we focus on narrowing the gap between the generative captioner and the CLIP classifier. We begin by analysing the predictions made by the captioner and classifier and observe that the caption generation inherits the distribution bias from the language model trained with pure text modality, making it less grounded on the visual signal. To tackle this problem, we redesign the scoring objective for the captioner to alleviate the distributional bias and focus on measuring the gain of information brought by the visual inputs. We further design a generative training objective to match the evaluation objective. We name our model trained and evaluated from the novel procedures as Information Gain (IG) captioner. We pretrain the models on the public Laion-5B dataset and perform a series of discriminative evaluations. For the zero-shot classification on ImageNet, IG captioner achieves > 18\% improvements over the standard captioner, achieving comparable performances with the CLIP classifier. IG captioner also demonstrated strong performance on zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this paper inspires further research towards unifying generative and discriminative training procedures for visual-language models.


# 98
Strong Double Blind
CIC-BART-SSA: : Controllable Image Captioning with Structured Semantic Augmentation

Kalliopi Basioti · Mohamed A Abdelsalam · Federico Fancellu · Vladimir Pavlovic · Afsaneh Fazly

Controllable Image Captioning (CIC) aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities or events of interest. However, available image--language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current methods. We use this Structured Semantic Augmentation (SSA) framework to augment existing image--caption datasets with the grounded controlled captions, increasing their spatial and semantic diversity and focal coverage. We then develop a new model, CIC-BART-SSA, specifically tailored for the CIC task, that sources its control signals from SSA-diversified datasets. We empirically show that, compared to SOTA CIC models, CIC-BART-SSA generates captions that are superior in diversity and text quality, are competitive in controllability, and, importantly, minimize the gap between broad and highly focused controlled captioning performance by efficiently generalizing to the challenging highly focused scenarios.


# 93
Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

Fangzhou Song · Bin Zhu · Yanbin Hao · Shuo Wang

Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR. We will make our code and models publicly available.


# 94
Strong Double Blind
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Naoya Sogi · Takashi Shibata · Makoto Terao

The pre-trained vision and language (V\&L) models have substantially improved the performance of cross-modal image-text retrieval. In general, however, these V\&L models have limited retrieval performance for small objects because of the rough alignment between words and the small objects in the image. In contrast, it is known that human cognition is object-centric, and we pay more attention to important objects, even if they are small. To bridge this gap between the human cognition and the V\&L model's capability, we propose a cross-modal image-text retrieval framework based on ``object-aware query perturbation.'' The proposed method generates a key feature subspace of the detected objects and perturbs the corresponding queries using this subspace to improve the object awareness in the image. In our proposed method, object-aware cross-modal image-text retrieval is possible while keeping the rich expressive power and retrieval performance of existing V\&L models without additional fine-tuning. Comprehensive experiments on four public datasets show that our method outperforms conventional algorithms.


# 83
Strong Double Blind
Cascade Prompt Learning for Visual-Language Model Adaptation

Ge Wu · Xin Zhang · Zheng Li · Zhaowei Chen · Jiajun Liang · Jian Yang · Xiang Li

Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning (CasPL) framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It's worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85% for base classes, 3.44% for novel classes, and 2.72% for the harmonic mean over 11 image classification datasets.


# 175
Strong Double Blind
Gaze Target Detection Based on Head-Local-Global Coordination

Yaokun Yang · Feng Lu

This paper introduces a novel approach to gaze target detection leveraging a head-local-global coordination framework. Unlike traditional methods that rely heavily on estimating gaze direction and identifying salient objects in global view images, our method incorporates a FOV-based local view to more accurately predict gaze targets. We also propose a unique global-local position and representation consistency mechanism to integrate the features from head view, local view, and global view, significantly improving prediction accuracy. Through extensive experiments, our approach demonstrates state-of-the-art performance on multiple significant gaze target detection benchmarks, showcasing its scalability and the effectiveness of the local view and view-coordination mechanisms. The method's scalability is further evidenced by enhancing the performance of existing gaze target detection methods within our proposed head-local-global coordination framework.


# 73
Strong Double Blind
Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

Yang Jin · Lei Zhang · Shi Yan · Bin Fan · Binglu Wang

Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, \ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model’s inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method.


# 91
Strong Double Blind
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

William Zhu · Keren Ye · Junjie Ke · Jiahui Yu · Leonidas Guibas · Peyman Milanfar · Feng Yang

Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zeroshot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP’s contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute’s relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).


# 87
Towards Open-Ended Visual Recognition with Large Language Models

Qihang Yu · Xiaohui Shen · Liang-Chieh Chen

Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception. Recent methods have endeavored to address the issue by employing a class-agnostic mask (or box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP) using pre-extracted text embeddings. However, it is worth noting that these open-vocabulary recognition models still exhibit limitations in practical applications. On one hand, they rely on the provision of class names during testing, where the recognition performance heavily depends on this predefined set of semantic classes by users. On the other hand, when training with multiple datasets, human intervention is required to alleviate the label definition conflict between them. In this paper, we introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask classifier, as a straightforward and effective solution to the aforementioned challenges. Specifically, OSM predicts class labels in a generative manner, thus removing the supply of class names during both training and testing. It also enables cross-dataset training without any human interference, exhibiting robust generalization capabilities due to the world knowledge acquired from the LLM. By combining OSM with an off-the-shelf mask proposal model, we present promising results on various benchmarks, and demonstrate its effectiveness in handling novel concepts. Code will be made available.


# 57
AFreeCA: Annotation-Free Counting for All

Adriano DAlessandro · Ali Mahdavi-Amiri · Ghassan Hamarneh

Object counting methods typically rely on manually annotated datasets. The cost of creating such datasets has restricted the versatility of these networks to count objects from specific classes (such as humans or penguins), and counting objects from diverse categories remains a challenge. The availability of robust text-to-image latent diffusion models (LDMs) raises the question of whether these models can be utilized to generate counting datasets. However, LDMs struggle to create images with an exact number of objects based solely on text prompts but they can be used to offer a dependable \textit{sorting} signal by adding and removing objects within an image. Leveraging this data, we initially introduce an unsupervised sorting methodology to learn object-related features that are subsequently refined and anchored for counting purposes using counting data generated by LDMs. Further, we present a density classifier-guided method for dividing an image into patches containing objects that can be reliably counted. Consequently, we can generate counting data for any type of object and count them in an unsupervised manner. Our approach outperforms other unsupervised and few-shot alternatives and is not restricted to specific object classes for which counting data is available. Code to be released upon acceptance.


# 64
Strong Double Blind
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models

Zijian Zhou · Zheng Zhu · Holger Caesar · Miaojing SHI

Panoptic Scene Graph Generation (PSG) aims to segment objects and recognize their relations, enabling the structured understanding of an image. Previous methods focus on predicting predefined object and relation categories, hence limiting their applications in the open world scenarios. With the rapid development of large multimodal models (LMMs), significant progress has been made in open-set object detection and segmentation, yet open-set object relation prediction in PSG remains unexplored. In this paper, we focus on the task of open-set object relation prediction integrated with a pretrained open-set panoptic segmentation model to achieve true open-set panoptic scene graph generation. To this end, we propose an Open-set Panoptic Scene Graph Generation method (OpenPSG), which leverages LMMs to achieve open-set relation prediction in an autoregressive manner. We introduce a relation query transformer to efficiently extract visual features of object pairs and estimate the existence of relations between them. The latter can enhance the prediction efficiency by filtering irrelevant pairs. Finally, we design the generation and judgement instructions to perform open-set relation prediction in PSG autoregressively. To our knowledge, we are the first to propose the open-set PSG task. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-set relation prediction and panoptic scene graph generation.


# 74
Strong Double Blind
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Kuo Wang · Lechao Cheng · Weikai Chen · Pingping Zhang · Liang Lin · Fan Zhou · Guanbin Li

Learning from pseudo-labels that generated with VLMs (Vision Language Models) has been shown as a promising solution to assist open vocabulary detection (OVD) in recent studies. However, due to the domain gap between VLM and vision-detection tasks, pseudo-labels produced by the VLMs are prone to be noisy, while the training design of the detector further amplifies the bias. In this work, we investigate the root cause of VLMs' biased prediction under the OVD context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets and optimizes the learning procedure in an online manner by marrying the capability of the detector with the vision-language model. Our key insight is that the detector itself can act as a strong auxiliary guidance to accommodate VLM's inability of understanding both the `background'' and the context of a proposal within the image. Based on it, we greatly purify the noisy pseudo-labels via Online Mining and propose Adaptive Reweighting to effectively suppress the biased training boxes that are not well aligned with the target object. In addition, we also identify a neglectedbase-novel-conflict' problem and introduce stratified label assignments to prevent it. Extensive experiments on COCO and LVIS datasets demonstrate that our method outperforms the other state-of-the-arts by significant margins. Code will be released.


# 85
Strong Double Blind
Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Ruihuang Li · Zhengqiang ZHANG · Chenhang He · Zhiyuan MA · Vishal Patel · Yabin Zhang

Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. However, their applications to 3D dense prediction tasks often encounter the difficulties of limited high-quality and densely-annotated 3D data. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space for maximizing their synergistic benefits. Instead of extracting coarse view- or region-level text prompts, we leverage large vision-language models to extract complete category information and scalable scene descriptions to build the text modality, and take image modality as the bridge to build dense point-pixel-text associations. Besides, in order to enhance the generalization ability of the 2D model for downstream 3D tasks without compromising the open-vocabulary capability, we employ a dual-path integration approach to combine frozen CLIP visual features and learnable mask features. Extensive experiments show that our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.


# 103
Strong Double Blind
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Yi-Chia Chen · WeiHua Li · Cheng Sun · Yu-Chiang Frank Wang · Chu-Song Chen

We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.


# 84
Strong Double Blind
Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining

Diwei Su · cheng fei · Jianxu Luo

In recent years, vision transformers based on self-attention mechanisms have demonstrated remarkable abilities in various tasks such as natural language processing, computer vision (CV), and multimodal applications. However, due to the high computational costs and the structural nature of images, Transformers in CV face challenges in handling ultra-high-resolution images. Recently, several token reduction methods have been proposed to improve the computational efficiency of Transformers by reducing the number of tokens without the need for retraining. These methods primarily involve fusion based on matching or clustering. The former exhibits faster speed but suffers more accuracy loss compared to the latter. In this work, we propose a simple matching-based fusion method called Token Adapter, which achieves comparable accuracy to the clustering-based fusion method with faster speed and demonstrates higher potential in terms of robustness. Our method was applied to Segmenter and MaskDINO, exhibiting promising performance on three tasks, including semantic segmentation, instance segmentation, and panoptic segmentation. Specifically, our method can be applied to Segmenter on ADE20k, providing 41% frames per second (FPS) acceleration while maintaining full performance without retraining or fine-tuning off-the-shelf weights.


# 86
Strong Double Blind
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Mengcheng Lan · Chaofeng Chen · Yiping Ke · Xinjiang Wang · Litong Feng · Wayne Zhang

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.


# 80
Strong Double Blind
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

Tong Shao · Zhuotao Tian · Hang Zhao · Jingyong Su

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.


# 77
Strong Double Blind
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation

Soojin Jang · JungMin Yun · JuneHyoung Kwon · Eunju Lee · YoungBin Kim

Weakly supervised semantic segmentation (WSSS) approaches typically rely on class activation maps (CAMs) for initial seed generation, which often fail to capture global context due to limited supervision from image-level labels. To address this issue, we introduce DALNet, Dense Alignment Learning Network that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our key insight is to employ a dual-level alignment strategy: (1) Global Implicit Alignment (GIA) to capture global semantics by maximizing the similarity between the class token and the corresponding text embeddings while minimizing the similarity with background embeddings, and (2) Local Explicit Alignment (LEA) to improve object localization by utilizing spatial information from patch tokens. Moreover, we propose a cross-contrastive learning approach that aligns foreground features between image and text modalities while separating them from the background, encouraging activation in missing regions and suppressing distractions. Through extensive experiments on the PASCAL VOC and MS COCO datasets, we demonstrate that DALNet significantly outperforms state-of-the-art WSSS methods. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.


# 135
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

Yash Bhalgat · Iro Laina · Joao F Henriques · Andrew ZISSERMAN · Andrea Vedaldi

Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image-space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.


# 75
Prioritized Semantic Learning for Zero-shot Instance Navigation

Xinyu Sun · Lizhao Liu · Hongyan Zhi · Ronghe Qiu · Junwei Liang

We study zero-shot instance navigation, in which the agent navigates to a specific object without using object annotations for training. Previous object navigation approaches apply the image-goal navigation (ImageNav) task (go to the location of an image) for pretraining, and transfer the agent to achieve object goals using a vision-language model. However, these approaches lead to issues of semantic neglect, where the model fails to learn meaningful semantic alignments. In this paper, we propose a Prioritized Semantic Learning (PSL) method to improve the semantic understanding ability of navigation agents. Specifically, a semantic-enhanced PSL agent is proposed and a prioritized semantic training strategy is introduced to select goal images that exhibit clear semantic supervision and relax the reward function from strict exact view matching. At inference time, a semantic expansion inference scheme is designed to preserve the same granularity level of the goal-semantic as training. Furthermore, for the popular HM3D environment, we present an Instance Navigation (InstanceNav) task that requires going to a specific object instance with detailed descriptions, as opposed to the Object Navigation (ObjectNav) task where the goal is defined merely by the object category. Our PSL agent outperforms the previous state-of-the-art by 66% on zero-shot ObjectNav in terms of success rate and is also superior on the new InstanceNav task.


# 117
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

Amrin Kareem · Jean Lahoud · Hisham Cholakkal

Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning part segmentation for 3D objects, aiming to output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate evaluation and benchmarking, we present a large 3D dataset comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations specifically curated for reasoning-based 3D part segmentation. We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. Experiments show that our method achieves competitive performance to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge. Our source code, dataset, and trained models are publicly available.


# 82
SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Lukas Hoyer · David Tan · Muhammad Ferjad Naeem · Luc Van Gool · Federico Tombari

In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we newly propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by instructing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state of the art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 annotated images. The source code will be released with the paper.


# 65
Strong Double Blind
Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

Tao Chen · Xiruo Jiang · Gensheng Pei · Zeren Sun · yucheng wang · Yazhou Yao

Though adversarial erasing has prevailed in weakly supervised semantic segmentation to help activate integral object regions, existing approaches still suffer from the dilemma of under-activation and over-expansion due to the difficulty in determining when to stop erasing. In this paper, we propose a \textbf{K}nowledge \textbf{T}ransfer with \textbf{S}imulated Inter-Image \textbf{E}rasing (KTSE) approach for weakly supervised semantic segmentation to alleviate the above problem. In contrast to existing erasing-based methods that remove the discriminative part for more object discovery, we propose a simulated inter-image erasing scenario to weaken the original activation by introducing extra object information. Then, object knowledge is transferred from the anchor image to the consequent less activated localization map to strengthen network localization ability. Considering the adopted bidirectional alignment will also weaken the anchor image activation if appropriate constraints are missing, we propose a self-supervised regularization module to maintain the reliable activation in discriminative regions and improve the inter-class object boundary recognition for complex images with multiple categories of objects. In addition, we resort to intra-image erasing and propose a multi-granularity alignment module to gently enlarge the object activation to boost the object knowledge transfer. Extensive experiments and ablation studies on PASCAL VOC 2012 and COCO datasets demonstrate the superiority of our proposed approach.


# 66
Strong Double Blind
ProMerge: Prompt and Merge for Unsupervised Instance Segmentation

Dylan Li · Gyungin Shin

Unsupervised instance segmentation aims to segment distinct object instances in an image without relying on human-labeled data. This field has recently seen significant advancements, partly due to the strong local correspondences afforded by rich visual feature representations from self-supervised models (e.g., DINO). Recent state-of-the-art approaches tackle this challenge by framing instance segmentation as a graph partitioning problem, solved via a generalized eigenvalue system (i.e., normalized-cut) using the self-supervised features. While effective, this strategy is limited by its computational demands, leading to slow inference speeds. In our work, we propose Prompt and Merge (ProMerge), a computationally efficient yet competitive method. We begin by leveraging self-supervised visual features to obtain initial groupings of patches and apply a strategic merging to these segments, aided by a sophisticated background-based mask pruning technique. ProMerge not only yields competitive results but also offers a significant reduction in inference time compared to state-of-the-art normalized-cut-based approaches. Furthermore, by training an object detector (i.e., Cascade Mask R-CNN) using our mask predictions as pseudo-labels, our results reveal that this detector surpasses current leading unsupervised methods. The code will be made publicly available.


# 282
Strong Double Blind
Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

cheng Shi · Yulin zhang · Bin Yang · Jiajin Tang · Yuexin Ma · Sibei Yang

Unsupervised 3D instance segmentation aims to segment objects from a 3D point cloud without any annotations. Existing methods face the challenge of either too loose or too tight clustering, leading to under-segmentation or over-segmentation. To address this issue, we propose Part2Object, hierarchical clustering with object guidance. Part2Object employs multi-layer clustering from points to object parts and objects, allowing objects to manifest at any layer. Additionally, it extracts and utilizes 3D objectness priors from temporally consecutive 2D RGB frames to guide the clustering process. Moreover, we propose Hi-Mask3D to support hierarchical 3D object part and instance segmentation. By training Hi-Mask3D on the objects and object parts extracted from Part2Object, we achieve consistent and superior performance compared to state-of-the-art models in various settings, including unsupervised instance segmentation, data-efficient fine-tuning, and cross-dataset generalization. The code will be made publicly available.


# 278
Strong Double Blind
Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation

Ruijie Xu · Chuyu Zhang · Hui Ren · Xuming He

We tackle the novel class discovery in point cloud segmentation, which discovers novel classes based on the semantic knowledge of seen classes. Existing work proposes an online point-wise clustering method with a simplified equal class-size constraint on the novel classes to avoid degenerate solutions. However, the inherent imbalanced distribution of novel classes in point clouds contradicts the equal class-size constraint, and point-wise clustering ignores the rich spatial context information of objects, which results in less expressive representation for semantic segmentation. To address the above challenges, we propose a novel self-labeling strategy that adaptively generates high-quality pseudo-labels for imbalanced classes during model training. In addition, we develop a dual-level representation that incorporates regional consistency into the point-level classifier learning, reducing noise in generated segmentation. Finally, we conduct extensive experiments on two widely used datasets, SemanticKITTI and SemanticPOSS, and the results show our method outperforms the state-of-the-art by a large margin.


# 76
Strong Double Blind
Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond

Silvio Galesso · Philipp Schröppel · Hssan Driss · Thomas Brox

In recent years, research on out-of-distribution (OoD) detection for semantic segmentation has mainly focused on road scenes -- a domain with a constrained amount of semantic diversity. In this work, we challenge this constraint and extend the domain of this task to general natural images. To this end, we introduce: 1. the ADE-OoD benchmark, which is based on the ADE20k dataset and includes images from diverse domains with a high semantic diversity, and 2. a novel approach that uses Diffusion score matching for OoD detection (DOoD) and is robust to the increased semantic diversity. ADE-OoD features indoor and outdoor images, defines 150 semantic categories as in-distribution, and contains a variety of OoD objects. For DOoD, we train a diffusion model with an MLP architecture on semantic in-distribution embeddings and build on the score matching interpretation to compute pixel-wise OoD scores at inference time. On common road scene OoD benchmarks, DOoD performs on par or better than the state of the art, without using outliers for training or making assumptions about the data domain. On ADE-OoD, DOoD outperforms previous approaches, but leaves much room for future improvements.


# 71
UniFS: Universal Few-shot Instance Perception with Point Representations

Sheng Jin · Ruijie Yao · Lumin Xu · Wentao Liu · Chen Qian · Ji Wu · Ping Luo

Instance perception tasks (object detection, instance segmentation, pose estimation, counting) play a key role in industrial applications of visual models. As supervised learning methods suffer from high labeling cost, few-shot learning methods which effectively learn from a limited number of labeled examples are desired. Existing few-shot learning methods primarily focus on a restricted set of tasks, presumably due to the challenges involved in designing a generic model capable of representing diverse tasks in a unified manner. In this paper, we propose UniFS, a universal few-shot instance perception model that unifies a wide range of instance perception tasks by reformulating them into a dynamic point representation learning framework. Additionally, we propose Structure-Aware Point Learning (SAPL) to exploit the higher-order structural relationship among points to further enhance representation learning. Our approach makes minimal assumptions about the tasks, yet it achieves competitive results compared to highly specialized and well optimized specialist models. Codes and data are available at \url{https://github.com/jin-s13/UniFS}.


# 68
Strong Double Blind
Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes

Zhi Cai · Yingjie Gao · Yaoyan Zheng · Nan Zhou · Di Huang

In computer vision, pedestrian detection is an important task that finds its application in many scenarios. However, obtaining extensive labels can be challenging, especially in crowded scenes. Current solutions to this problem often rely on massive unlabeled data and extensive training to achieve promising results. In this paper, we propose Crowd-SAM, a novel few-shot object detection/segmentation framework that utilizes SAM for pedestrian detection. CrowdSAM combines the advantage of DINO and SAM, to localize the foreground first and then generate dense prompts conditioned on the heatmap. Specifically, we design an efficient prompt sampler to deal with the dense point prompts and a Part-Whole Discrimination Network that re-scores the decoded masks and selects the proper ones. Our experiments are conducted on the CrowdHuman and CityScape benchmarks. On CrowdHuman, CrowdSAM achieves 78.4 A, which is comparable to fully supervised object detectors, and SOTA in few-shot object detectors, with 10-shot labeled images and less than 1M learnable parameters.


# 67
Strong Double Blind
Adaptive Multi-task Learning for Few-shot Object Detection

Yan Ren · Yanling Li · Wai-Kin Adams Kong

The majority of few-shot object detection methods use a shared feature map for both classification and localization, despite the conflicting requirements of these two tasks. Localization needs scale and positional sensitive features, whereas classification requires features that are robust to scale and positional variations. Although few methods have recognized this challenge and attempted to address it, they may not provide a comprehensive resolution to the issue. To overcome the contradictory preferences between classification and localization in few-shot object detection, an adaptive multi-task learning method, featuring a novel precision-driven gradient balancer, is proposed. This balancer effectively mitigates the conflicts by dynamically adjusting the backward gradient ratios for both tasks. Furthermore, a knowledge distillation and classification refinement scheme based on CLIP is introduced, aiming to enhance individual tasks by leveraging the capabilities of large vision-language models. Experimental results of the proposed method consistently show improvements over strong few-shot detection baselines on benchmark datasets.


# 220
Strong Double Blind
FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection

Jianwei Zhao · Xin Li · Fan Yang · Qiang Zhai · Ao Luo · Zhicheng Jiao · Hong Cheng

Detecting objects seamlessly blended into their surroundings represents a complex task for both human cognitive capabilities and advanced artificial intelligence algorithms. Currently, the majority of methodologies for detecting camouflaged objects mainly focus on utilizing discriminative models with various unique designs. However, it has been observed that generative models, such as Stable Diffusion, possess stronger capabilities for understanding various objects in complex environments; Yet their potential for the cognition and detection of camouflaged objects has not been extensively explored. In this study, we present a novel denoising diffusion model, namely FocusDiffuser, to investigate how generative models can enhance the detection and interpretation of camouflaged objects. We believe that the secret to spotting camouflaged objects lies in catching the subtle nuances in details. Consequently, our FocusDiffuser innovatively integrates specialized enhancements, notably the Boundary-Driven LookUp (BDLU) module and Cyclic Positioning (CP) module, to elevate standard diffusion models, significantly boosting the detail-oriented analytical capabilities. Our experiments demonstrate that FocusDiffuser, from a generative perspective, effectively addresses the challenge of camouflaged object detection, surpassing leading models on benchmarks like CAMO, COD10K and NC4K.


# 62
Strong Double Blind
Distilling Knowledge from Large-Scale Image Models for Object Detection

Gang Li · Wenhai Wang · Xiang Li · Ziheng Li · Jian Yang · Jifeng Dai · Yu Qiao · Shanshan Zhang

Large-scale image models have made great progress in recent years, pushing the boundaries of many vision tasks, e.g., object detection. Considering that deploying large models is impractical in many scenes due to expensive computation overhead, this paper presents a new knowledge distillation method, which Distills knowledge from Large-scale Image Models for object detection (dubbed DLIM-Det). To this end, we make the following two efforts: (1) To bridge the gap between the teacher and student, we present a frozen teacher approach. Specifically, to create the teacher model via fine-tuning large models on a specific task, we freeze the pretrained backbone and only optimize the task head. This preserves the generalization capability of large models and gives rise to distinctive characteristics in the teacher. In particular, when equipped with DEtection TRansformers (DETRs), the frozen teacher exhibits sparse query locations, thereby facilitating the distillation process. (2) Considering that large-scale detectors are mainly based on DETRs, we propose a Query Distillation (QD) method specifically tailored for DETRs. The QD performs knowledge distillation by leveraging the spatial positions and pair-wise relations of teacher's queries as knowledge to guide the learning of object queries of the student model. Extensive experiments are conducted on various large-scale image models with parameter sizes ranging from 200M to 1B. Our DLIM-Det improves the student with Swin-Tiny by 3.1 mAP when the DINO detector with Swin-Large is used as the teacher. Besides, even when the teacher has 30 times more parameters than the student, DLIM-Det still attains a +2.9 distillation gain.


# 69
Strong Double Blind
Revisiting Domain-Adaptive Object Detection in Adverse Weather by the Generation and Composition of High-Quality Pseudo-Labels

Rui Zhao · Huibin Yan · Shuoyao Wang

Due to data collection challenges, the mean-teacher learning paradigm has emerged as a critical approach for cross-domain object detection, especially in adverse weather conditions. Despite making significant progress, existing methods are still plagued by low-quality pseudo-labels in the degraded images. In this paper, we propose a generation-composition paradigm training framework that includes the tiny-object-friendly loss, i.e., IAoU loss with a joint-filtering and student-aware strategy to improve pseudo-labels generation quality and refine the filtering scheme. Specifically, in the generation phase of pseudo-labels, we observe that bounding box regression is essential for feature alignment and develop the IAoU loss to improve the accuracy of bounding box regression, further facilitating subsequent feature alignment. We also find that selecting bounding boxes based solely on classification confidence does not perform well in cross-domain noisy image scenes. Moreover, relying exclusively on predictions from the teacher model could cause the student model to collapse. Accordingly, in the composition phase, we introduce the mean-teacher model with a joint-filtering and student-aware strategy combining classification and regression thresholds from both the student and the teacher models. Our extensive experiments, conducted on synthetic and real-world adverse weather datasets, clearly demonstrate that the proposed method surpasses state-of-the-art benchmarks across all scenarios, particularly achieving a 12.4% improvement of mAP i.e., cityscapes to RTTS.


# 22
Strong Double Blind
Operational Open-Set Recognition and PostMax Refinement

Steve Cruz · Ryan Rabinowitz · Manuel Günther · Terrance E. Boult

Open-Set Recognition (OSR) is a problem with mainly practical applications. However, recent evaluations have been dominated by small-scale data and ignore the real-world operational needs of parameter selection. Thus, we revisit the original OSR goals and showcase an improved large-scale evaluation. For modern large networks, we show how deep feature magnitudes of unknown samples are surprisingly larger than for known samples. Therefore, we introduce a novel PreMax algorithm that prenormalizes the logits with the deep feature magnitude and use an extreme-value-based generalized Pareto distribution to map the scores into proper probabilities. Additionally, our proposed Operational Open-Set Accuracy (OOSA) measure can be used to predict an operationally relevant threshold and compare different algorithms' real-world performances. For various pre-trained ImageNet classifiers, including both leading transformer and convolution-based architectures, our experiments demonstrate that PreMax advances the state of the art in open-set recognition with statistically significant improvements on traditional metrics such as AUROC, FPR95, and also on our novel OOSA metric.


# 53
InfMAE: A Foundation Model in The Infrared Modality

Fangcen liu · Chenqiang Gao · Yaming Zhang · Junjie Guo · Jinghao Wang · Deyu Meng

In recent years, foundation models have swept the computer vision field, facilitating the advancement of various tasks within different modalities. However, effectively designing an infrared foundation model remains an open question. In this paper, we introduce InfMAE, a foundation model tailored specifically for the infrared modality. Initially, we present Inf30, an infrared dataset developed to mitigate the scarcity of large-scale data for self-supervised learning within the infrared vision community. Moreover, considering the intrinsic characteristics of infrared images, we design an information-aware masking strategy. It allows for a greater emphasis on the regions with richer information in infrared images during the self-supervised learning process, which is conducive to learning strong representations. Additionally, to enhance generalization capabilities in downstream tasks, we employ a multi-scale encoder for latent representation learning. Finally, we develop an infrared decoder to reconstruct images. Extensive experiments show that our proposed method InfMAE outperforms other supervised and self-supervised learning methods in three key downstream tasks: infrared image semantic segmentation, object detection, and small target detection.


# 283
Strong Double Blind
AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking

Yuheng Li · Tianyu Luan · Yizhou Wu · Shaoyan Pan · Yenho Chen · Xiaofeng Yang

Due to the scarcity of labeled data, self-supervised learning (SSL) has gained much attention in 3D medical image segmentation, by extracting semantic representations from unlabeled data. Among SSL strategies, Masked image modeling (MIM) has shown effectiveness by reconstructing randomly masked images to learn detailed representations. However, conventional MIM methods require extensive training data to achieve good performance, which still poses a challenge for medical imaging. Since random masking uniformly samples all regions within medical images, it may overlook crucial anatomical regions and thus degrade the pretraining efficiency. We propose AnatoMask, a novel MIM method that leverages reconstruction loss to dynamically identify and mask out anatomically significant regions to improve pretraining efficacy. AnatoMask takes a self-distillation approach, where the model learns both how to find more significant regions to mask and how to reconstruct these masked regions. To avoid suboptimal learning, Anatomask adjusts the pretraining difficulty progressively using a masking dynamics function. We have evaluated our method on 4 public datasets with multiple imaging modalities (CT, MRI and PET). AnatoMask demonstrates superior performance and scalability compared to existing SSL methods.


# 261
Strong Double Blind
Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction

Wanting Zhang · Huisi Wu · Jing Qin

Breast ultrasound image segmentation is a very challenging task due to the low contrast and blurred boundary between the breast mass and the background. Our goal is to utilize the powerful feature extraction capability of SAM and make out-of-domain tuning to help SAM distinguish breast masses from background. To this end, we propose a novel model called SFRecSAM, which inherits the model architecture of SAM but makes improvements to adapt to breast ultrasound image segmentation. First, we propose a spatial-frequency feature fusion module, which utilizes the fused spatial-frequency features to obtain a more comprehensive feature representation. This fusion feature is used to make up for the shortcomings of SAM's ViT image encoder in extracting low-level feature of masses. It complements the texture details and boundary structure information of masses to better segment targets in low contrast ultrasound images. Second, we propose a dual false corrector, which identifies and corrects false positive and false negative regions using uncertainty estimation, to further improve the segmentation accuracy. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art methods on two representative public breast ultrasound datasets: BUSI and UDIAT. Codes will be released upon publication.


# 144
Effective Lymph Nodes Detection in CT Scans Using Location Debiased Query Selection and Contrastive Query Representation in Transformer

Qinji Yu · Yirui Wang · Ke Yan · Haoshen Li · Dazhou Guo · Li Zhang · Na Shen · Qifeng Wang · Xiaowei Ding · Le Lu · Xianghua Ye · Dakai Jin

Lymph node (LN) assessment is a critical, indispensable yet very challenging task in the routine clinical workflow of radiology and oncology. Accurate LN analysis is essential for cancer diagnosis, staging, and treatment planning. Finding scatteredly distributed, low-contrast clinically relevant LNs in 3D CT is difficult even for experienced physicians under high inter-observer variations. Previous automatic LN detection works typically yield limited recall and high false positives (FPs) due to adjacent anatomies with similar image intensities, shapes, or textures (vessels, muscles, esophagus, etc). In this work, we propose a new LN DEtection TRansformer, named LN-DETR, to achieve more accurate performance. By enhancing the 2D backbone with a multi-scale 2.5D feature fusion to incorporate 3D context explicitly, more importantly, we make two main contributions to improve the representation quality of LN queries. 1) Considering that LN boundaries are often unclear, an IoU prediction head and a location debiased query selection are proposed to select LN queries of higher localization accuracy as the decoder query's initialization. 2) To reduce FPs, query contrastive learning is employed to explicitly reinforce LN queries towards their best-matched ground-truth queries over unmatched query predictions. Trained and tested on 3D CT scans of 1067 patients (with 10,000+ labeled LNs) via combining seven LN datasets from different body parts (neck, chest, and abdomen) and pathologies/cancers, our method significantly improves the performance of previous leading methods by > 4-5% average recall at the same FP rates in both internal and external testing. We further evaluate on the universal lesion detection task using NIH DeepLesion benchmark, and our method achieves the top performance of 88.46% averaged recall across 0.5 to 4 FPs per image, compared with other leading reported results.


# 136
Strong Double Blind
Snuffy: Efficient Whole Slide Image Classifier

Hossein Jafarinia · Alireza Alipanah · Saeed Razavi · Nahal Mirzaie · Mohammad Rohban

Whole Slide Image (WSI) classification has recently gained much attention in digital pathology, facing the researchers with unique challenges. We tackle two main challenges in existing (WSI) classification approaches. Initially, these methods invest a considerable amount of time and computational resources into pre-training a vision backbone on domain-specific training datasets for embedding generation. This limits the development of such models to certain groups and institutions with computational budgets, and hence slowing their development progress. Furthermore, they typically employ architectures with limited approximation capabilities for the Multiple Instance Learning (MIL), which are inadequate for the intricate characteristics of WSIs, resulting in a sub-optimal accuracy. Our research proposes novel solutions to these issues and balances efficiency and performance. Firstly, we present the novel approach of the continual self-supervised pretraining of ImageNet-1K Vision Transformers (ViTs) equipped with Adapters on pathology domain datasets, achieving a level of efficiency orders of magnitude better than prior techniques. Secondly, we introduce an innovative Sparse Transformer architecture and theoretically prove its universal approximability, featuring a new upper bound for the layer count. We additionally evaluate our method on both pathology and MIL datasets, showcasing its superiority on image- and patch-level accuracies compared to the previous methods. Our code is available at \url{https://github.com/jafarinia/snuffy}.


# 92
Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Xiaoxuan He · Yifan Yang · Xinyang Jiang · Xufang Luo · Haoji Hu · Siyun Zhao · Dongsheng Li · Yuqing Yang · Lili Qiu

Vision-Language Pre-training (VLP) has shown the merits of analysing medical images. It efficiently learns visual representations by leveraging supervisions in their corresponding reports, and in turn facilitates analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly 2D images like X-rays), adapting VLP to learning unified representations for medical images in real scenario remains an open challenge. This arises from medical images often encompass a variety of modalities, especially modalities with different dimensions (e.g., 3D images like Computed Tomography), and there are almost no paired multi-dimension data here. To overcome the aforementioned challenges, we propose an \textbf{U}nified \textbf{Med}ical \textbf{I}mage Pre-training framework, namely UniMedI, which utilizes diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images (especially for 2D and 3D images). Under the text's guidance, UniMedI effectively select text-related 2D slices from sophisticated 3D volume, which acts as pseudo-pairs to bridge 2D and 3D data, ultimately enhancing the consistency across various medical imaging modalities. To demonstrate the effectiveness and versatility of UniMedI, we evaluate its performance on both 2D and 3D images across several different datasets, covering a wide range of medical image tasks such as classification, segmentation, and retrieval. UniMedI has demonstrated superior performance in downstream tasks, showcasing its effectiveness in establishing a universal medical visual representation.


# 279
Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging

Peirong Liu · Oula Puonti · Xiaoling Hu · Daniel Alexander · Juan E. Iglesias

Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography (CT), yet they struggle to generalize in uncalibrated modalities -- notably magnetic resonance (MR) imaging, where performance is highly sensitive to the differences in MR contrast, resolution, and orientation. This prevents broad applicability to diverse real-world clinical protocols. We introduce Brain-ID, an anatomical representation learning model for brain imaging. With the proposed "mild-to-severe" intra-subject generation, Brain-ID is robust to the subject-specific brain anatomy regardless of the appearance of acquired images (e.g., contrast, deformation, resolution, artifacts). Trained entirely on synthetic data, Brain-ID readily adapts to various downstream tasks through only one layer. We present new metrics to validate the intra- and inter-subject robustness of Brain-ID features, and evaluate their performance on four downstream applications, covering contrast-independent (anatomy reconstruction/contrast synthesis, brain segmentation), and contrast-dependent (super-resolution, bias field estimation) tasks. Extensive experiments on six public datasets demonstrate that Brain-ID achieves state-of-the-art performance in all tasks on different MRI modalities and CT, and more importantly, preserves its performance on low-resolution and small datasets.


# 89
Strong Double Blind
TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

Siyi Du · Shaoming Zheng · Yinsong Wang · Wenjia Bai · Declan P. O&#x27;Regan · Chen Qin

Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising to create new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios, without considering the missing data issue, and thus are limited in practice. In this paper, we propose TIP, a novel tabular-image pre-training framework for learning multimodal representations robust to incomplete tabular data. Specifically, TIP investigates a novel self-supervised learning (SSL) strategy, including a masked tabular reconstruction task for tackling data missingness, and image-tabular matching and contrastive learning objectives to capture multimodal information. Moreover, TIP proposes a versatile tabular encoder tailored for incomplete, heterogeneous tabular data and a multimodal interaction module for inter-modality representation learning. Experiments are performed on downstream multimodal classification tasks using both natural and medical image datasets. The results show that TIP outperforms state-of-the-art supervised/SSL image/multimodal algorithms in both complete and incomplete data scenarios. Our code will be available at https://github.com/anonymous.


# 56
TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection

Matic Fučka · Vitjan Zavrtanik · Danijel Skocaj

Surface anomaly detection is a vital component in manufacturing inspection. Current discriminative methods follow a two-stage architecture composed of a reconstructive network followed by a discriminative network that relies on the reconstruction output. Currently used reconstructive networks often produce poor reconstructions that either still contain anomalies or lack details in anomaly-free regions. Discriminative methods are robust to some reconstructive network failures, suggesting that the discriminative network learns a strong normal appearance signal that the reconstructive networks miss. We reformulate the two-stage architecture into a single-stage iterative process that allows the exchange of information between the reconstruction and localization information. We propose a novel transparency-based diffusion process where the transparency of anomalous regions is progressively increased, restoring their normal appearance accurately while maintaining the appearance of anomaly-free regions using localization cues of previous steps. We implement the proposed process as TRANSparency DifFUSION (TransFusion), a novel discriminative anomaly detection method that achieves state-of-the-art performance on both the VisA and the MVTec AD datasets, with an image-level AUROC of 98.5% and 99.2%, respectively. Code: Added upon acceptance


# 70
Strong Double Blind
VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

Zhen Qu · Xian Tao · Mukesh Prasad · Fei Shen · Zhengtao Zhang · Xinyi Gong · Guiguang Ding

Recently, large-scale vision-language models such as CLIP have demonstrated immense potential in zero-shot anomaly segmentation (ZSAS) task, utilizing a unified model to directly detect anomalies on any unseen product given painstakingly crafted text prompts. However, existing methods often assume that the product category to be inspected is known, thus setting product-specific text prompts, which is difficult to achieve in the data privacy scenario. Moreover, even the same type of product exhibit significant differences due to specific components and variations in the production process, posing significant challenges to the design of text prompts. In this end, we propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP. The insight behind VCP-CLIP is to employ visual context prompting to activate CLIP’s anomalous semantic perception ability. In specific, we first design a Pre-VCP module to embed global visual information into the text prompt, thus eliminating the necessity for product-specific prompts. Then, we propose a novel Post-VCP module, that adjust the text embeddings utilizing the fine-grained features of the images. In extensive experiments conducted on 10 real-world industrial anomaly segmentation datasets, VCP-CLIP achieved state-of-the-art performance in ZSAS task. The implementation of this method will be published upon acceptance.


# 47
Strong Double Blind
Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt

Bin-Bin Gao

Unsupervised reconstruction networks using self-attention transformers have achieved state-of-the-art results in multi-class (unified) anomaly detection using a single model. However, these self-attention reconstruction models primarily operate on target features themselves, resulting in perfect reconstruction for both normal and anomaly features due to high consistency with context, leading to failure in anomaly detection. Additionally, these models often result in inaccurate anomaly segmentation due to performing reconstruction in low spatial resolution latent space. To enable reconstruction models enjoying high efficiency while enhancing their generalization for anomaly detection, we propose a simple yet effective method that reconstructs normal features and restores anomaly features with just One Normal Image Prompt (OneNIP). In contrast to previous work, OneNIP allows for the first time to reconstruct or restore anomalies with just one normal image prompt, effectively boosting anomaly detection performance. Furthermore, we propose a supervised refiner that regresses reconstruction errors by using both real normal and synthesized anomalous images, which significantly improves pixel-level anomaly segmentation. OneNIP outperforms previous methods on three industry anomaly detection benchmarks, MVTec, BTAD, and ViSA. We will make the code for OneNIP available.


# 34
Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection

Yongwei Nie · Hao Huang · Chengjiang Long · Qing Zhang · Pradipta Maji · Hongmin Cai

Video Anomaly Detection (VAD) has been extensively studied under the settings of One-Class Classification (OCC) and Weakly-Supervised learning (WS), which however both require laborious human-annotated normal/abnormal labels. In this paper, we study Unsupervised VAD (UVAD) that does not need any label by combining OCC and WS into a unified training framework. Specifically, we extend OCC to weighted OCC (wOCC) and propose a wOCC-WS interleaving training module, where the two models automatically generate pseudo-labels for each other. We face two challenges to make the combination effective: (1) Models' performance fluctuates occasionally during the training process due to the inevitable randomness of the pseudo labels. (2) Thresholds are needed to divide pseudo labels, making the training depend on the accuracy of user intervention. For the first problem, we propose to use wOCC requiring soft labels instead of OCC trained with hard zero/one labels, as soft labels exhibit high consistency throughout different training cycles while hard labels are prone to sudden changes. For the second problem, we repeat the interleaving training module multiple times, during which we propose an adaptive thresholding strategy that can progressively refine a rough threshold to a relatively optimal threshold, which reduces the influence of user interaction. A benefit of employing OCC and WS methods to compose a UVAD method is that we can incorporate the most recent OCC or WS model into our framework. We test three OCC models and two WS models. Experiments demonstrate the effectiveness of the proposed UVAD framework.


# 139
Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision

Hussain Sajwani · Dimitrios Makris · Yahya Zweiri · Fariborz Baghaei Naeini · Sanket Mr Kachole

Spiking Neural Networks (SNNs) offer a biologically inspired approach to computer vision that can lead to more efficient processing of visual data with reduced energy consumption. However, maintaining homeostasis within SNNs is challenging, as it requires continuous adjustment of neural responses to preserve equilibrium and optimal processing efficiency amidst diverse and often unpredictable input signals. In response to these challenges, we propose the Asynchronous Bioplausible Neuron (ABN), a dynamic spike firing mechanism that offers a simple yet potent auto-adjustment to variations in input signals. Its parameters, Membrane Gradient (MG), Threshold Retrospective Gradient (TRG), and Spike Efficiency (SE), make it stand out for its easy implementation, significant effectiveness, and proven reduction in power consumption, a key innovation demonstrated in our experiments. Comprehensive evaluation across various datasets demonstrates ABN's enhanced performance in image classification and segmentation, maintenance of neural equilibrium, and energy efficiency. The code will be publicly available on the GitHub Project Page.


# 58
SAIR: Learning Semantic-aware Implicit Representation

Canyu Zhang · Xiaoguang Li · Qing Guo · Song Wang

Implicit representation of an image can map arbitrary coordinates in the continuous domain to their corresponding color values, presenting a powerful capability for image reconstruction. Nevertheless, existing implicit representation approaches only focus on building continuous appearance mapping, ignoring the continuities of the semantic information across pixels. As a result, they can hardly achieve desired reconstruction results when the semantic information within input images is corrupted, for example, a large region misses. To address the issue, we propose to learn semantic-aware implicit representation (SAIR), that is, we make the implicit representation of each pixel rely on both its appearance and semantic information (\eg, which object does the pixel belong to).To this end, we propose a framework with two modules: (1) building a semantic implicit representation (SIR) for a corrupted image whose large regions miss. Given an arbitrary coordinate in the continuous domain, we can obtain its respective text-aligned embedding indicating the object the pixel belongs. (2) building an appearance implicit representation (AIR) based on the SIR. Given an arbitrary coordinate in the continuous domain, we can reconstruct its color whether or not the pixel is missed in the input. We validate the novel semantic-aware implicit representation method on the image inpainting task, and the extensive experiments demonstrate that our method surpasses state-of-the-art approaches by a significant margin.


# 55
Strong Double Blind
Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

Yibing Wei · Abhinav Gupta · Pedro Morgado

Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images. It excels in region-aware learning and provides strong initializations for various tasks, but struggles to capture high-level semantics without further supervised fine-tuning, likely due to the low-level nature of its pixel reconstruction objective. A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets. However, this approach poses significant training challenges as the reconstruction targets are learned in conjunction with the model, potentially leading to trivial or suboptimal solutions. Our study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM. Through a series of carefully designed experiments and extensive analysis, we identify the source of these challenges, including representation collapsing for joint onlin/target optimization, learning objectives, the high region correlation in latent space and decoding conditioning. By sequentially addressing these issues, we demonstrate that latent MIM can indeed learn high-level representations while retaining the benefits of MIM models.


# 52
Learning with Unmasked Tokens Drives Stronger Vision Learners

Taekyung Kim · Sanghyuk Chun · Byeongho Heo · Dongyoon Han

Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage the enhanced unmasked tokens for MIM. As a result, our simple remedy trains more discriminative representations revealed by achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, our models achieve significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Our code will be publicly available.


# 81
Emerging Property of Masked Token for Effective Pre-training

Hyesong Choi · Hunsang Lee · Seyoung Joung · Hyejin Park · Jiyeong Kim · Dongbo Min

Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This paper presents a perspective that the optimization of masked tokens as a means of addressing the prevailing issue. Initially, we delve into an exploration of the inherent properties that a masked token ought to possess. Within the properties, we principally dedicated to articulating and emphasizing the `data singularity' attribute inherent in masked tokens. Through a comprehensive analysis of the heterogeneity between masked tokens and visible tokens within pre-trained models, we propose a novel approach termed \textbf{masked token optimization (MTO)}, specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens. The proposed method serves as an adaptable solution that seamlessly integrates into any MIM approach that leverages masked tokens. As a result, MTO achieves a considerable improvement in pre-training efficiency, resulting in an approximately 50% reduction in pre-training epochs required to attain converged performance of the recent approaches.


# 51
Strong Double Blind
Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

Danish Nazir · Timo Bartels · Jan Piewek · Thorsten Bagdonat · Tim Fingscheidt

Distributed computing in the context of deep neural networks (DNNs) implies the execution of one part of the network on edge devices and the other part typically on a large-scale cloud platform. Conventional methods propose to employ a serial concatenation of a learned image and source encoder, the latter projecting the image encoder output (bottleneck features) into a quantized representation for bitrate-efficient transmission. In the cloud, a respective source decoder reprojects the quantized representation to the original feature representation, serving as an input for the downstream task decoder performing, e.g., semantic segmentation. In this work, we propose joint source and task decoding, as it allows for a smaller network size in the cloud. This further enables the scalability of such services in large numbers without asking for extensive computational load on the cloud per channel. We demonstrate the effectiveness of our method by achieving a distributed semantic segmentation SOTA over a wide range of bitrates on the mean intersection over union metric, while using only 9.8% ... 11.59% of cloud DNN parameters used in previous SOTA on the COCO and Cityscapes datasets.


# 60
The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

Seungwoo Son · Jegwang Ryu · Namhoon Lee · Jaeho Lee

Knowledge distillation is an effective method for training lightweight vision models. However, acquiring teacher supervision for training samples is often costly, especially from large-scale models like vision transformers (ViTs). In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation: masking out a fraction of input tokens given to the teacher. By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture. We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy, while other masking criterion leads to suboptimal efficiency gains. Through in-depth analyses, we reveal that the student-guided masking provides a good curriculum to the student, making teacher supervision easier to follow during the early stage and challenging in the later stage.


# 63
Strong Double Blind
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Haiwen Diao · Bo Wan · XU JIA · Yunzhi Zhuge · Ying Zhang · Huchuan Lu · Long Chen

Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for its ability to adapt large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To further alleviate memory burden, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting the exhaustive exploration of prior knowledge from pre-trained models. Moreover, the dependency and redundancy between cross-layer features are frequently overlooked, thereby submerging more discriminative representations and causing an inherent performance gap (vs. conventional PETL methods). Hence, we propose an innovative strategy called SHERL, which stands for Synthesizing High-accuracy and Efficient-memory under Resource-Limited scenarios. Concretely, SHERL introduces a Multi-Tiered Sensing Adapter (MTSA) to decouple the entire adaptation into two successive and complementary processes. In the early route, intermediate outputs are consolidated via an anti-redundancy operation, enhancing their compatibility for subsequent interactions; thereby in the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead and regulate these fairly flexible features into more adaptive and powerful representations for new domains. We validate SHERL on 18 datasets across various vision-and-language and language-only NLP tasks. Extensive ablations demonstrate that SHERL combines the strengths of both parameter and memory-efficient techniques, performing on-par or better across diverse architectures with lower memory during fine-tuning.


# 27
Strong Double Blind
Tight and Efficient Upper Bound on Spectral Norm of Convolutional Layers

Ekaterina Grishina · Mikhail Gorbunov · Maxim Rakhuba

Controlling the spectral norm of convolution layers has been shown to enhance generalization and robustness in CNNs, as well as training stability and the quality of generated samples in GANs. Existing methods for computing singular values either result in loose bounds or lack scalability with input and kernel sizes. In this paper, we obtain a new upper bound that is independent of input size, differentiable and can be efficiently computed during training. Through experiments, we demonstrate how this new bound can be used to improve the performance of convolutional architectures.


# 44
Strong Double Blind
FYI: Flip Your Images for Dataset Distillation

Byunggwan Son · Youngmin Oh · Donghyeon Baek · BUMSUB HAM

Dataset distillation synthesizes a small set of images from a large-scale real dataset such that synthetic and real images share similar behavioral properties (e.g, distributions of gradients or features) during a training process. Through extensive analyses on current methods and real datasets, together with empirical observations, we provide in this paper two important things to share for dataset distillation. First, object parts that appear on one side of a real image are highly likely to appear on the opposite side of another image within a dataset, which we call the bilateral equivalence. Second, the bilateral equivalence enforces synthetic images to duplicate discriminative parts of objects on both the left and right sides of the images, which prevents recognizing subtle differences between objects. To address this problem, we introduce a surprisingly simple yet effective technique for dataset distillation, dubbed FYI, that enables distilling rich semantics of real images into synthetic ones. To this end, FYI embeds a horizontal flipping technique into a distillation process, mitigating the influence of the bilateral equivalence, while capturing more details of objects. Experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet demonstrate that FYI can be seamlessly integrated into several state-of-the-art methods, without modifying training objectives and network architectures, and it improves the distillation performance remarkably. We will make our code and synthetic datasets publicly available online.


# 39
Strong Double Blind
Data-to-Model Distillation: Data-Efficient Learning Framework

Ahmad Sajedi · Samir Khaki · Lucy Z. Liu · Ehsan Amjadian · Yuri A. Lawryshyn · Konstantinos Plataniotis

Dataset distillation aims to distill the knowledge of a large-scale real dataset into small yet informative synthetic data such that a model trained on it performs as well as a model trained on the full dataset. Despite recent progress, existing dataset distillation methods often struggle with computational efficiency, scalability to complex high-resolution datasets, and generalizability to deep architectures. These approaches typically require retraining when the distillation ratio changes, as knowledge is embedded in raw pixels. In this paper, we propose a novel framework called Data-to-Model distillation (D2M) to distill the real dataset’s knowledge into the learnable parameters of a pre-trained generative model by aligning rich representations extracted from real and generated images. The learned generative model can then produce informative training images for different distillation ratios and deep architectures. Extensive experiments on 15 datasets of varying resolutions show D2M's superior performance, re-distillation efficiency, and cross-architecture generalizability. Our method effectively scales up to high-resolution 128x128 ImageNet-1K. Furthermore, we verify D2M's practical benefits for downstream applications in neural architecture search.


# 15
Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection

Yunfeng Fan · Wenchao Xu · Haozhao Wang · Fushuo Huo · Jinyu Chen · Song Guo

Selecting proper clients to participate in each federated learning (FL) round is critical to effectively harness a broad range of distributed data. Existing client selection methods simply consider to mine the distributed uni-modal data, yet, their effectiveness may diminish in multi-modal FL (MFL) as the modality imbalance problem not only impedes the collaborative local training but also lead to a severe global modality-level bias. We empirically reveal that local training with a certain single modality may contribute more to the global model than training with all local modalities. To effectively exploit the distributed multimodalities, we propose a novel Balanced Modality Selection framework for MFL (BMSFed) to overcome the modal bias. On the one hand, we introduce a modal enhancement loss during local training to alleviate local imbalance based on the aggregated global prototypes. On the other hand, we propose the modality selection aiming to select subsets of local modalities with great diversity and achieving global modal balance simultaneously. Our extensive experiments on audio-visual, colored-gray, and front-back datasets showcase the superiority of BMSFed over baselines and its effectiveness in multi-modal data exploitation.


# 40
Active Generation for Image Classification

Tao Huang · Jiaqi Liu · Shan You · Chang Xu

Recently, the growing capabilities of deep generative models have underscored their potential in enhancing image classification accuracy. However, existing methods often demand the generation of a disproportionately large number of images compared to the original dataset, while having only marginal improvements in accuracy. This computationally expensive and time-consuming process hampers the practicality of such approaches. In this paper, we propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model. With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation. It aims to create images akin to the challenging or misclassified samples encountered by the current model and incorporates these generated images into the training set to augment model performance. ActGen introduces an attentive image guidance technique, using real images as guides during the denoising process of a diffusion model. The model's attention on class prompt is leveraged to ensure the preservation of similar foreground object while diversifying the background. Furthermore, we introduce a gradient-based generation guidance method, which employs two losses to generate more challenging samples and prevent the generated images from being too similar to previously generated ones. Experimental results on the CIFAR and ImageNet datasets demonstrate that our method achieves better performance with a significantly reduced number of generated images.


# 16
Strong Double Blind
Contrastive Learning with Synthetic Positives

Dewen Zeng · Xinrong Hu · Yawen Wu · Xiaowei Xu · Yiyu Shi

Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques by utilizing the similarity of multiple instances within the same class. However, its efficacy is constrained as the nearest neighbor algorithm primarily identifies "easy" positive pairs, where the representations are already closely located in the embedding space. In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives. Through feature interpolation in the diffusion model sampling process, we generate images with distinct backgrounds yet similar semantic content to the anchor image. These images are considered "hard" positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2% and 1% in linear evaluation compared to the previous NNCLR and All4One methods across multiple benchmark datasets such as CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks, CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We believe CLSP establishes a valuable baseline for future SSL studies incorporating synthetic data in the training process.


# 72
Strong Double Blind
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

Longxiang Tang · Zhuotao Tian · Kai Li · Chunming He · Hantao Zhou · Hengshuang ZHAO · Xiu Li · Jiaya Jia

This study addresses the Domain-Class Incremental Learning problem, a realistic but challenging continual learning scenario where both the domain distribution and target classes vary across tasks. To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability. However, this incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability. Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy computation overhead. To address this problem efficiently, we propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of VLMs from a perspective of avoiding information interference. Specifically, we design a fully residual mechanism to infuse newly learned knowledge into a frozen backbone, while introducing minimal adverse impacts on pre-trained knowledge. Besides, this residual property enables our distribution-aware integration calibration scheme, explicitly controlling the information implantation process for test data from unseen distributions. Experiments demonstrate that our DIKI surpasses the current state-of-the-art approach using only 0.86% of the trained parameters and requiring substantially less training time.


# 79
Strong Double Blind
Robust Calibration of Large Vision-Language Adapters

Balamurali Murugesan · Julio Silva-Rodríguez · Ismail Ben Ayed · Jose Dolz

This paper addresses the critical issue of miscalibration in CLIP-based adaptation models in the challenging scenario of out-of-distribution samples, which has been overlooked in the existing literature on CLIP adaptation. We empirically demonstrate that popular CLIP adaptation approaches, such as adapters, prompt learning, and test-time prompt tuning, substantially degrade the calibration capabilities of the zero-shot baseline in the presence of distributional drift. We identify the increase in logit ranges as the underlying cause of miscalibration of CLIP adaptation methods, contrasting with previous work on calibrating fully supervised models. Motivated by these observations, we present a simple and model-agnostic solution to mitigate miscalibration, by scaling the logit range of each sample based on its zero-shot prediction logits. We explore three different alternatives to achieve this, which can be either integrated during adaptation, or directly used at inference time. Comprehensive experiments on popular OOD classification benchmarks demonstrate the effectiveness of the proposed approaches in mitigating miscalibration while maintaining discriminative performance, whose improvements are consistent across the three families of these increasingly popular approaches.


# 28
Strong Double Blind
Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

Reza Abbasi · Mohammad Rohban · Mahdieh Soleymani Baghshah

CLIP models have recently shown to exhibit Out of Distribution (OoD) generalization capabilities. However, Compositional Out of Distribution (C-OoD) generalization, which is a crucial aspect of a model's ability to understand unseen compositions of known concepts, is relatively unexplored for the CLIP models. Our goal is to address this problem and identify the factors that contribute to the C-OoD in CLIPs. We noted that previous studies regarding compositional understanding of CLIPs frequently fail to ensure that test samples are genuinely novel relative to the CLIP training data. To this end, we carefully synthesized a large and diverse dataset in the single object setting, comprising attributes for objects that are highly unlikely to be encountered in the combined training datasets of various CLIP models. This dataset enables an authentic evaluation of C-OoD generalization. Our observations reveal varying levels of C-OoD generalization across different CLIP models. We propose that the disentanglement of CLIP representations serves as a critical indicator in this context. By utilizing our synthesized datasets and other existing datasets, we assess various disentanglement metrics of text and image representations. Our study reveals that the disentanglement of image and text representations, particularly with respect to their compositional elements, plays a crucial role in improving the generalization of CLIP models in out-of-distribution settings. This finding suggests promising opportunities for advancing out-of-distribution generalization in CLIPs.


# 25
FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning

Oscar Skean · Aayush Dhakal · Nathan Jacobs · Luis G Sanchez Giraldo

Self-supervised learning (SSL) is a popular paradigm for representation learning. Recent multiview methods can be classified as sample-contrastive, dimension-contrastive, or asymmetric network-based, with each family having its own approach to avoiding informational collapse. While these families converge to solutions of similar quality, it can be empirically shown that some methods are epoch-inefficient and require longer training to reach a target performance. Two main approaches to improving efficiency are covariance eigenvalue regularization and using more views. However, these two approaches are difficult to combine due to the computational complexity of computing eigenvalues. We present the objective function FroSSL which reconciles both approaches while avoiding eigendecomposition entirely. FroSSL works by minimizing covariance Frobenius norms to avoid collapse and minimizing mean-squared error to achieve augmentation invariance. We show that FroSSL reaches competitive accuracies more quickly than any other SSL method and provide theoretical and empirical support that this faster convergence is due to how FroSSL affects the eigenvalues of the embedding covariance matrices. We also show that FroSSL learns competitive representations on linear probe evaluation when used to train a ResNet18 on several datasets, including STL-10, Tiny Imagenet, and Imagenet-100. We plan to release all code upon acceptance.


# 24
Strong Double Blind
Benchmarking Spurious Bias in Few-Shot Image Classifiers

Guangtao Zheng · Wenqian Ye · Aidong Zhang

Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in a certain set of samples and few-shot classifiers can suffer from spurious bias induced from it. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. We propose attribute-based sample selection strategies based on a pre-trained vision-language model to construct these few-shot tasks, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. The code will be public upon acceptance.


# 12
Strong Double Blind
An Information Theoretical View for Out-Of-Distribution Detection

Jinjing Hu · Wenrui Liu · Hong Chang · Bingpeng MA · Shiguang Shan · Xilin CHEN

Detecting out-of-distribution (OOD) inputs are pivotal for real-world applications. However, due to the inaccessibility of OODs during training phase, applying supervised binary classification with in-distribution (ID) and OOD labels is not feasible. Therefore, previous works typically employ the proxy ID classification task to learn feature representation for OOD detection task. In this study, we delve into the relationship between the two tasks through the lens of Information Theory. Our analysis reveals that optimizing the classification objective could inevitably cause the over-confidence and undesired compression of OOD detection-relevant information. To address these two problems, we propose OOD Entropy Regularization (OER) to regularize the information captured in classification-oriented representation learning for detecting OOD samples. Both theoretical analyses and experimental results underscore the consistent improvement of OER on OOD detection.


# 18
Strong Double Blind
ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

Erik Wallin · Lennart Svensson · Fredrik Kahl · Lars Hammarstrand

In semi-supervised learning, open-set scenarios present a challenge with the inclusion of unknown classes. Traditional methods predominantly use softmax outputs for distinguishing in-distribution (ID) from out-of-distribution (OOD) classes and often resort to arbitrary thresholds, overlooking the data's statistical properties. Our study introduces a new classification approach based on the angular relationships within feature space, alongside an iterative algorithm to probabilistically assess whether data points are ID or OOD, based on their conditional distributions. This methodology is encapsulated in ProSub, our proposed framework for open-set semi-supervised learning, which has demonstrated superior performance on several benchmarks. The accompanying source code will be released upon publication.


# 11
Strong Double Blind
Adapting to Shifting Correlations with Unlabeled Data Calibration

Minh Nguyen · Alan Q Wang · Heejong Kim · Mert Sabuncu

Distribution shifts between sites can seriously degrade model performance since models are prone to exploiting spurious correlations. Thus, many methods try to find features that are stable across sites and discard “spurious” or unstable features. However, unstable features might have complementary information that, if used appropriately, could increase accuracy. More recent methods try to adapt to unstable features at the new sites to achieve higher accuracy. However, they either make unrealistic assumptions or fail to scale to multiple confounding features. We propose Generalized Prevalence Adjustment (GPA for short), an flexible method that adjusts model prediction to the shifting correlations between prediction target and confounders to safely exploit unstable features. GPA can infer the interaction between target and confounders in new sites using unlabeled samples from those sites. We evaluate GPA on several real and synthetic datasets, and show that it outperforms competitive baselines.


# 4
Strong Double Blind
Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels

Jae Soon Baik · In Young Yoon · Kun Hoon Kim · Jun Won Choi

Deep neural networks have demonstrated remarkable advancements in various fields using large, well-annotated datasets. However, real-world data often exhibit a long-tailed distribution and label noise, significantly degrading generalization performance. Recent studies to address these issues have focused on sample selection methods that estimate the centroid of each class based on a high-confidence sample set within the corresponding target class. However, these methods are limited because they use only a small portion of the training dataset for class centroid estimation due to long-tailed distributions and noisy labels. In this study, we introduce a novel feature-based sample selection method called Distribution-aware Sample Selection (DaSS). Specifically, DaSS leverages model predictions to incorporate features from not only the target class but also various other classes, producing class centroids. By integrating this approach with temperature scaling, we can adaptively exploit informative training data, resulting in class centroids that more accurately reflect the true distribution under long-tailed noisy scenarios. Moreover, we propose confidence-aware contrastive learning to obtain a balanced and robust representation under long-tailed distributions and noisy labels. This method employs semi-supervised balanced contrastive loss for high-confidence labeled samples to mitigate class bias by leveraging trustworthy label information alongside mixup-enhanced instance discrimination loss for low-confidence labeled samples to improve their representation degraded by noisy labels. Comprehensive experimental results on CIFAR and real-world noisy-label datasets demonstrate the effectiveness and superior performance of our method over baselines.


# 17
On Pretraining Data Diversity for Self-Supervised Learning

Hasan Abed El Kader Hammoud · Tuhin Das · Fabio Pizzati · Philip Torr · Adel Bibi · Bernard Ghanem

We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the inherent distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. The code and trained models will be made available online.


# 2
De-Confusing Pseudo-Labels in Source-Free Domain Adaptation

Idit Diamant · Amir Rosenfeld · Idan Achituve · Jacob Goldberger · Arnon Netzer

Source-free domain adaptation (SFDA) aims to adapt a source-trained model to an unlabeled target domain without access to the source data. SFDA has attracted growing attention in recent years, where existing approaches focus on self-training that usually includes pseudo-labeling techniques. In this paper, we introduce a novel noise-learning approach tailored to address noise distribution in domain adaptation settings and learn to de-confuse the pseudo-labels. More specifically, we learn a noise transition matrix of the pseudo-labels to capture the label corruption of each class and learn the underlying true label distribution. Estimating the noise transition matrix enables a better true class-posterior estimation, resulting in better prediction accuracy. We demonstrate the effectiveness of our approach when combined with several SFDA methods: SHOT, SHOT++, and AaD. We obtain state-of-the-art results on three domain adaptation datasets: VisDA, DomainNet, and OfficeHome.


# 6
Strong Double Blind
Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach

Aveen Dayal · Rishabh Lalla · Linga Reddy Cenkeramaddi · C. Krishna Mohan · Abhinav Kumar · Vineeth N Balasubramanian

Unsupervised domain adaptation (UDA) is a critical challenge in machine learning, aiming to transfer knowledge from a labeled source domain to an unlabeled target domain. In this work, we aim to improve target set accuracy in any existing UDA method by introducing an approach that utilizes pseudo-candidate sets for labeling the target data. These pseudo-candidate sets serve as a proxy for the true labels in the absence of direct supervision. To enhance the accuracy of the target domain, we propose Unsupervised Domain Adaptation refinement using Pseudo-Candidate Sets (UDPCS), a method which effectively learns to disambiguate among classes in the pseudo-candidate set. Our approach is characterized by two distinct loss functions: one that acts on the pseudo-candidate set to refine its predictions and another that operates on the labels outside the pseudo-candidate set. We use a threshold-based strategy to further guide the learning process toward accurate label disambiguation. We validate our novel yet simple approach through extensive experiments on three well-known benchmark datasets: Office-Home, VisDA, and DomainNet. Our experimental results demonstrate the efficacy of our method in achieving consistent gains on target accuracies across these datasets.


# 3
Strong Double Blind
Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation

Bowei Xing · Xianghua Ying · Ruibin Wang · Ruohao Guo · Ji Shi · Wenzhen Yue

Source free domain adaptation (SFDA) aims to transfer the model trained on labeled source domain to unlabeled target domain without accessing source data. Recent SFDA methods predominantly rely on self-training, which supervise the model with pseudo labels generated from individual data samples. However, they often ignore the crucial data structure information and sample relationships that are beneficial for adaptive training. In this paper, we propose a novel hierarchical relation distillation framework, establishing multi-level relations across samples in an unsupervised manner, which fully exploits inherent data structure to guide the sample training instead of using the isolated pseudo labels. We first distinguish the source-like samples based on prediction reliability in the training process, followed by an effort on distilling knowledge to those target-specific ones by transferring both local clustering relation and global semantic relation. Specifically, we leverage the affinity with nearest neighborhood samples for local relation and consider the similarity to category-wise Gaussian Mixtures for global relation, offering complementary supervision to facilitate student learning. To validate the effectiveness of our approach, we conduct extensive experiments on four diverse benchmarks, achieving better performance compared to previous methods.


# 14
Strong Double Blind
Source-Free Domain-Invariant Performance Prediction

Ekaterina Khramtsova · Mahsa Baktashmotlagh · Guido Zuccon · Xi Wang · Mathieu Salzmann

Accurately estimating model performance poses a significant challenge, particularly in scenarios where the source and target domains follow different data distributions. Most existing performance prediction methods heavily rely on the source data in their estimation process, limiting their applicability in a more realistic setting where only the trained model is accessible. The few methods that do not require source data exhibit considerably inferior performance. In this work, we propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data. We establish connections between our approach for unsupervised calibration and temperature scaling. We then employ a gradient-based strategy to evaluate the correctness of the calibrated predictions. Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability. Furthermore, our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.


# 7
Strong Double Blind
Learning to Complement and to Defer to Multiple Users

Zheng Zhang · Wenjie Ai · Kevin Wells · David M Rosewarne · Thanh-Toan Do · Gustavo Carneiro

With the development of Human-AI Collaboration in Classification (HAI-CC), integrating users and AI predictions becomes challenging due to the complex decision-making process. This process has three options: 1) AI autonomously classifies, 2) learning to complement, where AI collaborates with users, and 3) learning to defer, where AI defers to users. Despite their interconnected nature, these options have been studied in isolation rather than as components of a unified system. In this paper, we address this weakness with the novel HAI-CC methodology, called Learning to Complement and to Defer to Multiple Users (LECODU). LECODU not only combines learning to complement and learning to defer strategies, but it also incorporates an estimation of the optimal number of users to engage in the decision process. The training of LECODU maximises classification accuracy and minimises collaboration costs associated with user involvement. Comprehensive evaluations across real-world and synthesized datasets demonstrate LECODU's superior performance compared to state-of-the-art HAI-CC methods. Remarkably, even when relying on unreliable users with high rates of label noise, LECODU exhibits significant improvement over both human decision-makers alone and AI alone.


# 5
Strong Double Blind
Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation

Zhilin Zhu · Xiaopeng Hong · Zhiheng Ma · Weijun Zhuang · YaoHui Ma · Yong Dai · Yaowei Wang

Continual Test-Time Adaptation (CTTA) involves adapting a pre-trained source model to continually changing unsupervised target domains. In this paper, we systematically analyze the challenges of this task: online environment, unsupervised nature, and the risks of error accumulation and catastrophic forgetting under continual domain shifts. To address these challenges, we reshape the online data buffering and organizing mechanism for CTTA. By introducing an uncertainty-aware importance sampling approach, we dynamically evaluate and buffer the most significant samples with high certainty from the unsupervised, single-pass data stream. By efficiently organizing these samples into a class relation graph, we introduce a novel class relation preservation constraint to maintain the intrinsic class relations, efficiently overcoming catastrophic forgetting. Furthermore, a pseudo-target replay objective is incorporated to assign higher weights to these samples and mitigate error accumulation. Extensive experiments demonstrate the superiority of our method in both segmentation and classification CTTA tasks.


# 9
Strong Double Blind
Personalized Federated Domain-Incremental Learning based on Adaptive Knowledge Matching

Yichen Li · Wenchao Xu · Haozhao Wang · Yining Qi · Jingcai Guo · Ruixuan Li

This paper focuses on Federated Domain Incremental Learning (FDIL) where each client continues to learn incremental tasks where their domain shifts from each other. We propose a novel adaptive knowledge matching-based personalized FDIL approach (pFedDIL) which allows each client to alternatively utilize appropriate incremental task learning strategy on the correlation with the knowledge from previous tasks. More specifically, when a new task arrives, each client first calculates its local correlations with previous tasks. Then, the client can choose to adopt a new initial model or a previous model with similar knowledge to train the new task and simultaneously migrate knowledge from previous tasks based on these correlations. Furthermore, to identify the correlations between the new task and previous tasks for each client, we employ an auxiliary classifier to each target classification model and propose sharing partial parameters between the target classification model and the auxiliary classifier to condense model parameters. We conduct extensive experiments on several datasets of which results demonstrate that pFedDIL outperforms state-of-the-art methods by up to 14.35\% in terms of average accuracy of all tasks.


# 13
Revisiting Supervision for Continual Representation Learning

Daniel Marczak · Sebastian Cygert · Tomasz Trzcinski · Bartlomiej Twardowski

In the field of continual learning, models are designed to learn tasks one after the other. While most research has centered on supervised continual learning, there is a growing interest in unsupervised continual learning, which makes use of the vast amounts of unlabeled data. Recent studies have highlighted the strengths of unsupervised methods, particularly self-supervised learning, in providing robust representations. The improved transferability of those representations built with self-supervised methods is often associated with the role played by the multi-layer perceptron projector. In this work, we depart from this observation and reexamine the role of supervision in continual representation learning. We reckon that additional information, such as human annotations, should not deteriorate the quality of representations. Our findings show that supervised models when enhanced with a multi-layer perceptron head, can outperform self-supervised models in continual representation learning. This highlights the importance of the multi-layer perceptron projector in shaping feature transferability across a sequence of tasks in continual learning.


# 26
Strong Double Blind
Deep Companion Learning: Enhancing Generalization Through Historical Consistency

Ruizhao Zhu · Venkatesh Saligrama

We propose Deep Companion Learning (DCL), a novel training method for Deep Neural Networks (DNNs) that enhances generalization by penalizing inconsistent model predictions compared to its historical performance. To achieve this, we train a deep-companion model (DCM), by using previous versions of the model to provide forecasts on new inputs. This companion model deciphers a meaningful latent semantic structure within the data, thereby providing targeted supervision that encourages the primary model to address the scenarios it finds most challenging. We validate our approach through both theoretical analysis and extensive experimentation, including ablation studies, on a variety of benchmark datasets (CIFAR-100, Tiny-ImageNet, ImageNet-1K) using diverse architectural models (ShuffleNetV2, ResNet, Vision Transformer, etc.), demonstrating state-of-the-art performance.


# 21
Strong Double Blind
Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy

Tao Li · Weisen Jiang · Fanghui Liu · Xiaolin Huang · James Kwok

Pre-training followed by fine-tuning is widely adopted among practitioners. The performance can be improved by ``model soups''~\cite{wortsman2022model} via exploring various hyperparameter configurations. The Learned-Soup, a variant of model soups, significantly improves the performance but suffers from substantial memory and time costs due to the requirements of (i) having to load all fine-tuned models simultaneously, and (ii) a large computational graph encompassing all fine-tuned models. In this paper, we propose Memory Efficient Hyperplane Learned Soup (MEHL-Soup) to tackle this issue by formulating the learned soup as a hyperplane optimization problem and introducing block coordinate gradient descent to learn the mixing coefficients. At each iteration, MEHL-Soup only needs to load a few fine-tuned models and build a computational graph with one combined model. We further extend MEHL-Soup to MEHL-Soup+ in a layer-wise manner. Experiments on various ViT models and data sets show that MEHL-Soup(+) outperforms Learned-Soup(+) in terms of test accuracy, and also has more than 13x reduction in memory. Moreover, MEHL-Soup(+) can be run on a single GPU and achieves over 9x reduction in soup construction time compared with the Learned-Soup.


# 19
Strong Double Blind
Harmonizing knowledge Transfer in Neural Network with Unified Distillation

yaomin huang · faming Fang · Zaoming Yan · Chaomin Shen · guixu zhang

Knowledge distillation (KD), known for its ability to transfer knowledge from a cumbersome network (teacher) to a lightweight one (student) without altering the architecture, has been garnering increasing attention. Two primary categories emerge within KD methods: feature-based, focusing on intermediate layers' features, and logits-based, targeting the final layer's logits. This paper introduces a novel perspective by leveraging diverse knowledge sources within a unified KD framework. Specifically, we aggregate features from intermediate layers into a comprehensive representation, efficiently capturing essential knowledge without redundancy. Subsequently, we predict the distribution parameters from this representation. These steps transform knowledge from the intermediate layers into corresponding distributive forms, which are then conveyed through a unified distillation framework. Numerous experiments were conducted to validate the effectiveness of the proposed method. Remarkably, the distilled student network not only significantly outperformed its original counterpart but also, in many cases, surpassed the teacher network.


# 10
Strong Double Blind
Feature Diversification and Adaptation for Federated Domain Generalization

Seunghan Yang · Seokeon Choi · Hyunsin Park · Sungha Choi · Simyung Chang · Sungrack Yun

Federated learning, a distributed learning paradigm, utilizes multiple clients to build a robust global model. In real-world applications, local clients often operate within their limited domains, leading to a `domain shift' across clients. Privacy concerns limit each client's learning to its own domain data, which increase the risk of overfitting. Moreover, the process of aggregating models trained on own limited domain can be potentially lead to a significant degradation in the global model performance. To deal with these challenges, we introduce the concept of federated feature diversification. Each client diversifies the own limited domain data by leveraging global feature statistics, i.e., the aggregated average statistics over all participating clients, shared through the global model's parameters. This data diversification helps local models to learn client-invariant representations without causing privacy issues. Our resultant global model shows robust performance on unseen test domain data. To enhance performance further, we develop an instance-adaptive inference approach tailored for test domain data. Our proposed instance feature adapter dynamically adjusts feature statistics to align with the test input, thereby reducing the domain gap between the test and training domains. We show that our method achieves state-of-the-art performance on several domain generalization benchmarks within a federated learning setting.


# 8
Strong Double Blind
PFedEdit: Personalized Federated Learning via Automated Model Editing

Haolin Yuan · William Paul · John Aucott · Philippe Burlina · Yinzhi Cao

Federated learning (FL) allows clients to train a deep learning model collaboratively while maintaining their private data locally. One challenging problem facing FL is that the model utility drops significantly once the data distribution gets heterogeneous, or non-i.i.d, among clients. A promising solution is to personalize models for each client, e.g.,keeping some layers locally without aggregation, which is thus called personalized FL. However, previous personalized FL often suffer from suboptimal utility because their choice of layer personalization is based on empirical knowledge and fixed for different datasets and distributions. In this work, we design PFedEdit, the first federated learning frame work that leverages automated model editing to optimize the choice of personalization layers and improve model utility under a variety of data distributions including non-i.i.d. The high-level idea of PFedEdit is to assess the effectiveness of every global model layer in improving model utility on local data distribution once edited, and then to apply edits on the top-k most effective layers. Our evaluation shows that PFedEdit outperforms six state-of-the-art approaches on three benchmark datasets by 6% on the model’s performance on average, with the largest accuracy improvement being 26.6%. PFedEdit is open-source and available at this repository:


# 33
Enhanced Sparsification via Stimulative Training

Shengji Tang · Weihao Lin · Hancheng Ye · Peng Ye · Chong Yu · Baopu Li · Tao Chen

Sparsification-based pruning has been an important category in model compression. Existing methods commonly set sparsity-inducing penalty terms to suppress the importance of dropped weights, which is regarded as suppressed sparsification paradigm. However, this paradigm inactivates the to-be-dropped part of networks causing capacity damage before pruning, thereby leading to performance degradation. To alleviate this issue, we first study and reveal the relative sparsity effect in emerging stimulative training and then propose an enhanced sparsification paradigm framework named STP for structured pruning, which maintains the magnitude of dropped weights and enhances the expressivity of kept weights by self-distillation. Besides, to obtain a relatively optimal architecture of the final network, we propose multi-dimension architecture space and a knowledge distillation guided exploration strategy. To reduce the huge capacity gap of distillation, we propose a subnet mutating expansion technique. Extensive experiments on various benchmarks indicate the effectiveness of STP. Specifically, without pre-training or fine-tuning, our method consistently achieves superior performance at different budgets, especially under extremely aggressive pruning scenarios, e.g., remaining 95.11% Top-1 accuracy (72.43% in 76.15%) while reducing 85% FLOPs for ResNet-50 on ImageNet. Codes will be released.


# 30
Strong Double Blind
Dependency-aware Differentiable Neural Architecture Search

Buang Zhang · Xinle Wu · Hao Miao · Bin Yang · Chenjuan Guo

Neural architecture search (NAS) technology has reduced the burden of manual design by automatically building neural network architectures, among which differential NAS approaches such as DARTS, have gained popularity for the search efficiency. Despite achieving promising performance, the DARTS series methods still suffer two issues: 1) It does not explicitly establish dependencies between edges, potentially leading to suboptimal performance. 2) The high degree of parameter sharing results in inaccurate performance evaluations of subnets. To tackle these issues, we propose to model dependencies explicitly between different edges to construct a high-performance architecture distribution. Specifically, we model the architecture distribution in DARTS as a multivariate normal distribution with learnable mean vector and correlation matrix, representing the base architecture weights of each edge and the dependencies between different edges, respectively. Then, we sample architecture weights from this distribution and alternately train these learnable parameters and network weights by gradient descent. With the learned dependencies, we prune the search space dynamically to alleviate the inaccurate evaluation by only sharing weights among high-performance architectures. Besides, we identify good motifs by analyzing the learned dependencies, which guide human experts to manually design high-performance neural architectures. Extensive experiments and competitive results on multiple NAS Benchmarks demonstrate the effectiveness of our method.


# 35
Strong Double Blind
Layer-Wise Relevance Propagation with Conservation Property for ResNet

Seitaro Otsuki · Tsumugi Iida · Félix Doublet · Tsubasa Hirakawa · Takayoshi Yamashita · Hironobu Fujiyoshi · Komei Sugiura

Transparent formulation of explanation methods is essential for elucidating the predictions of neural network models, which are commonly of a black-box nature. Layer-wise Relevance Propagation (LRP) stands out as a well-established method that transparently traces the flow of a model’s prediction backward through its architecture by backpropagating relevance scores. However, LRP has not fully considered the existence of a skip connection, and its application to the widely used ResNet architecture has not been thoroughly explored. In this study, we extend LRP to the ResNet models by introducing relevance splitting at a point where outputs from a skip connection and a residual block converge. Moreover, our formulation ensures that the conservation property is maintained throughout the process, thereby preserving the integrity of the generated explanations. To evaluate the effectiveness of our approach, we conduct the experiments on ImageNet and CUB dataset. Our method demonstrated superior performance compared to baseline methods in standard evaluation metrics such as Insertion-Deletion score while maintaining its conservation property.


# 23
Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning

Chongyu Fan · Jiancheng Liu · Alfred Hero · Sijia Liu

The trustworthy machine learning (ML) community is increasingly recognizing the crucial need for models capable of selectively 'unlearning' data points after training. This leads to the problem of machine unlearning (MU), aiming to eliminate the influence of chosen data points on model performance, while still maintaining the model's utility post-unlearning. Despite various MU methods for data influence erasure, evaluations have largely focused on random data forgetting, ignoring the vital inquiry into which subset should be chosen to truly gauge the authenticity of unlearning performance. To tackle this issue, we introduce a new evaluative angle for MU from an adversarial viewpoint. We propose identifying the data subset that presents the most significant challenge for influence erasure, i.e., pinpointing the worst-case forget set. Utilizing a bi-level optimization principle, we amplify unlearning challenges at the upper optimization level to emulate worst-case scenarios, while simultaneously engaging in standard training and unlearning at the lower level, achieving a balance between data influence erasure and model utility. Our proposal offers a worst-case evaluation of MU's resilience and effectiveness. Through extensive experiments across different datasets (including CIFAR-10, 100, CelebA, Tiny ImageNet, and ImageNet) and models (including both image classifiers and generative models), we expose critical pros and cons in existing (approximate) unlearning strategies. Our results illuminate the complex challenges of MU in practice, guiding the future development of more accurate and robust unlearning algorithms.


# 20
Strong Double Blind
Training A Secure Model against Data-Free Model Extraction

Zhenyi Wang · Li Shen · junfeng guo · Tiehang Duan · Siyu Luan · Tongliang Liu · Mingchen Gao

The objective of data-free model extraction (DFME) is to acquire a pre-trained black-box model solely through query access, without any knowledge of the training data used for the victim model. Defending against DFME presents a formidable challenge because the distribution of attack query data and the attacker's strategy remain undisclosed to the defender beforehand. However, existing defense methods: (1) are computational and memory inefficient; or (2) can only provide evidence of model theft after the model has already been stolen. To address these limitations, we thus propose a defensive training method, named, Attack-Aware and Uncertainty-Guided (AAUG) defense method against DFME. AAUG is designed to effectively thwart DFME attacks while concurrently enhancing deployment efficiency. The key strategy involves introducing random weight perturbations to the victim model's weights during predictions for various inputs. During defensive training, the weights perturbations are maximized on simulated out-of-distribution (OOD) data to heighten the challenge of model theft, while being minimized on in-distribution (ID) training data to preserve model utility. Additionally, we formulate an attack-aware defensive training objective function, reinforcing the model's resistance to theft attempts. Extensive experiments on defending against both soft-label and hard-label DFME attacks demonstrate the effectiveness of AAUG. In particular, AAUG significantly reduces the accuracy of the clone model and is substantially more efficient than existing defense methods.


# 36
Strong Double Blind
CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks

Hao Fang · Jiawei Kong · Bin Chen · Tao Dai · Hao Wu · Shu-Tao Xia

Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios, raising significant security concerns. Recent studies have introduced \textit{single-target} generative attacks that train a generator for each target class to generate highly transferable perturbations, resulting in substantial computational overhead when handling multiple classes. \textit{Multi-target} attacks address this by training only one class-conditional generator for multiple classes. However, the generator simply uses class labels as conditions, failing to leverage the rich semantic information of the target class. To this end, we design a \textbf{C}LIP-guided \textbf{G}enerative \textbf{N}etwork with \textbf{C}ross-attention modules (CGNC) to enhance multi-target attacks by incorporating textual knowledge of CLIP into the generator. Extensive experiments demonstrate that CGNC yields significant improvements over previous multi-target generative attacks, e.g., a 21.46\% improvement in success rate from ResNet-152 to DenseNet-121. Moreover, we propose a masked fine-tuning mechanism to further strengthen our method in attacking a single class, which surpasses existing single-target methods.


# 38
Strong Double Blind
Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection

Youheng Sun · Shengming Yuan · Xuanhan Wang · Lianli Gao · Jingkuan Song

Targeted adversarial attack, which aims to mislead a model to recognize any image as a target object by imperceptible perturbations, has become a mainstream tool for vulnerability assessment of deep neural networks (DNNs). Since existing targeted attackers only learn to attack known target classes, they cannot generalize well to unknown classes.To tackle this issue, we propose Generalized Adversarial attacKER (GAKer), which is able to construct adversarial examples to any target class. The core idea behind GAKer is to craft a latently infected representation during adversarial example generation. To this end, the extracted latent representations of the target object are first injected into intermediate features of an input image in an adversarial generator. Then, the generator is optimized to ensure visual consistency with the input image while being close to the target object in the feature space. Since the GAKer is class-agnostic yet model-agnostic, it can be regarded as a general tool that not only reveals the vulnerability of more DNNs but also identifies deficiencies of DNNs in a wider range of classes. Extensive experiments have demonstrated the effectiveness of our proposed method in generating adversarial examples for both known and unknown classes. Notably, compared with other generative methods, our method achieves an approximate 14.13% higher attack success rate for unknown classes and an approximate 4.23% higher success rate for known classes.


# 37
Strong Double Blind
Leveraging Imperfect Restoration for Data Availability Attack

YI HUANG · Jeremy Styborski · Mingzhi Lyu · Fan Wang · Wai-Kin Adams Kong

The abundance of online data is at risk of unauthorized usage in training deep learning models. To counter this, various Data Availability Attacks (DAAs) have been devised to make data unlearnable for such models by subtly perturbing the training data. However, existing methods often excel in either Supervised Learning (SL) or Self-Supervised Learning (SSL) scenarios. Among these, a model-free approach that generates a Convolution-based Unlearnable Dataset (CUDA) stands out as the most robust DAAs across both SSL and SL. Nonetheless, CUDA's effectiveness on SSL remains inadequate and it faces a trade-off between image quality and its poisoning effect. In this paper, we conduct a theoretical analysis of CUDA, uncovering the sub-optimal gradients it introduces and elucidating the strategy it employs to induce class-wise bias for data poisoning. Building on this, we propose a novel poisoning method named Imperfect Restoration Poisoning (IRP), aiming to preserve high image quality while achieving strong poisoning effects. Through extensive comparisons of IRP with eight baselines across SL and SSL, coupled with evaluations alongside five representative defense methods, we showcase the superiority of IRP.


# 45
Strong Double Blind
Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNs

Shuchao Pang · Ruhao Ma · Bing Li · Yongbin Zhou · Yazhou Yao

Privacy laws like GDPR necessitate effective approaches to safeguard data privacy. Existing works on data privacy protection of DNNs mainly concentrated on the model training phase. However, these approaches become impractical when dealing with the outsourcing of sensitive data. Furthermore, they have encountered significant challenges in balancing the utility-privacy trade-off. How can we generate privacy-preserving surrogate data suitable for usage or sharing without a substantial performance loss? In this paper, we concentrate on a realistic scenario, where sensitive data must be entrusted to a third party for the development of a deep learning service. We introduce a straightforward yet highly effective framework for the practical protection of privacy in visual data via veiled examples. Our underlying concept involves segregating the privacy information present in images into two distinct categories: the privacy information perceptible at the human visual level (i.e., Human-perceptible Info) and the latent features recognized by DNN models during training and prediction (i.e., DNN-perceptible Info). Compared with the original data, the veiled data conserves the latent features while obfuscating the visual privacy information. Just like a learnable veil that is usable for DNNs but invisible for humans, the veiled data can be used for training and prediction of models. More importantly, models trained with the veiled data can effectively recognize the original data. Extensive evaluations of various datasets and models show the effectiveness and security of the Veil Privacy framework.


# 41
Strong Double Blind
Augmented Neural Fine-tuning for Efficient Backdoor Purification

Md Nazmul Karim · Abdullah Al Arafat · Umar Khalid · Zhishan Guo · Nazanin Rahnavard

Recent studies have revealed the vulnerability of deep neural networks (DNNs) to various backdoor attacks, where the behavior of DNNs can be compromised by utilizing certain types of triggers or poisoning mechanisms. State-of-the-art (SOTA) defenses employ too-sophisticated mechanisms that require either a computationally expensive adversarial search module for reverse-engineering the trigger distribution or an over-sensitive hyper-parameter selection module. Moreover, they offer sub-par performance in challenging scenarios, e.g., limited validation data and strong attacks. In this paper, we propose--{\em Neural mask Fine-Tuning (NFT)}-- with an aim to optimally re-organize the neuron activities in a way that the effect of the backdoor is removed. {\color{black}Utilizing a simple data augmentation like MixUp, NFT relaxes the trigger synthesis process and eliminates the requirement of the adversarial search module.} Our study further reveals that direct weight fine-tuning under limited validation data results in poor post-purification clean test accuracy, primarily due to \emph{overfitting issue}. To overcome this, we propose to fine-tune neural masks instead of model weights. In addition, a \emph{mask regularizer} has been devised to further mitigate the model drift during the purification process. The distinct characteristics of NFT render it highly efficient in both runtime and sample usage, as it can remove the backdoor even when a single sample is available from each class. We validate the effectiveness of NFT through extensive experiments covering the tasks of image classification, object detection, video action recognition, 3D point cloud, and natural language processing. We evaluate our method against 14 different attacks (LIRA, WaNet, etc.) on 11 benchmark data sets (ImageNet, UCF101, Pascal VOC, ModelNet, OpenSubtitles2012, etc).