Track: Poster Session 5

# 327

Strong Double Blind

Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering

Francesco Di Sario · Riccardo Renzulli · Marco Grangetto · Enzo Tartaglione

Since the introduction of NeRFs, considerable attention has been focused on improving their training and inference times, leading to the development of Fast-NeRFs models. Despite demonstrating impressive rendering speed and quality, the rapid convergence of such models poses challenges for further enhancing reconstruction quality. Common strategies to improve rendering quality involves augmenting model parameters or increasing the number of sampled points. However, these computationally intensive approaches encounter limitations in achieving significant quality enhancements. This study introduces a model-agnostic framework inspired by Sparsely-Gated Mixture of Experts to enhance rendering quality without escalating computational complexity. Our approach enables specialization in rendering different scene components by employing a mixture of experts with varying resolutions. We present a novel gate formulation designed to maximize expert capabilities and propose a resolution-based routing technique to effectively induce sparsity and decompose scenes. Our work significantly enhances reconstruction quality while maintaining competitive performance.

# 152

Strong Double Blind

Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

Xu Zheng · Yuanhuiyi Lyu · LIN WANG

Image modality is not perfect as it often fails in certain conditions, e.g., night and fast motion. This significantly limits the robustness and versatility of existing multi-modal (i.e., Image+X) semantic segmentation methods when confronting modality absence or failure, as often occurred in real-world applications. Inspired by the open-world learning capability of multi-modal vision-language models (MVLMs), we explore a new direction in learning the modality-agnostic representation via knowledge distillation (KD) from MVLMs. Intuitively, we propose Any2Seg, a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions. Specifically, we first introduce a novel language-guided semantic correlation distillation (LSCD) module to transfer both inter-modal and intra-modal semantic knowledge in the embedding space from MVLMs, e.g., LanguageBind. This enables us to minimize the modality gap and alleviate semantic ambiguity to combine any modalities in any visual conditions. Then, we introduce a modality-agnostic feature fusion (MFF) module that reweights the multi-modal features based on the inter-modal correlation and selects the fine-grained feature. This way, our Any2Seg finally yields an optimal modality-agnostic representation. Extensive experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting (+3.54 mIoU) and excels in the challenging modality-incomplete setting (+19.79 mIoU).

# 147

Diffusion Models for Open-Vocabulary Segmentation

Laurynas Karazija · Iro Laina · Andrea Vedaldi · Christian Rupprecht

Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-language modelling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation efforts. Hence, we ask if it is possible to use existing foundation models to synthesise on-demand efficient segmentation algorithms for specific class sets, making them applicable in an open-vocabulary setting without the need to collect further data, annotations or perform training. To that end, we present OVDiff, a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. OVDiff synthesises support image sets for arbitrary textual categories, creating for each a set of prototypes representative of both the category and its surrounding context (background). It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training. Our approach shows strong performance on a range of benchmarks, obtaining a lead of more than 5% over prior work on PASCAL VOC.

# 149

Strong Double Blind

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Siyu Jiao · hongguang Zhu · Yunchao Wei · Yao Zhao · Jiannan Huang · Humphrey Shi

Pre-trained vision-language models, e.g., CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on the ADE20K dataset, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ.

# 116

CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

Aoran Xiao · Weihao Xuan · Heli Qi · Yun Xing · Ruijie Ren · Xiaoqin Zhang · Ling Shao · Shijian Lu

The Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, it often struggles in domains that are either sparsely represented or lie outside its training distribution, such as aerial, medical, and non-RGB images. Recent efforts have predominantly focused on adapting SAM to these domains using fully supervised methods, which necessitate large amounts of annotated training data and pose practical challenges in data collection. This paper presents CAT-SAM, a ConditionAl Tuning network that explores few-shot adaptation of SAM toward various challenging downstream domains in a data-efficient manner. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the domain-specific features of the mask decoder to the image encoder, fostering synergic adaptation of both components with mutual benefits with few-shot target samples only, ultimately leading to superior segmentation in various downstream tasks. We develop two CAT-SAM variants that adopt two tuning strategies for the image encoder: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 downstream tasks show that CAT-SAM achieves superior segmentation consistently even under the very challenging one-shot adaptation setup. Code will be available.

# 111

Strong Double Blind

Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels

Yuan Gao · Zilei Wang · Yixin Zhang · Bohai Tu

Unsupervised Domain Adaptation (UDA) for semantic segmentation has been widely studied to exploit the label-rich source data to assist the segmentation of unlabeled samples on target domain. Despite these efforts, UDA performance remains far below that of fully-supervised model owing to the lack of target annotations. To this end, we propose an efficient superpixel-level active learning method for domain adaptive semantic segmentation to maximize segmentation performance by automatically querying a small number of superpixels for labeling. To conserve annotation resources, we propose a novel low-uncertainty superpixel fusion module which amalgamates superpixels possessing low-uncertainty features based on feature affinity and thereby ensuring high-quality fusion of superpixels. As for the acquisition strategy, our method takes into account two types of information-rich superpixels: large-size superpixels with substantial information content, and superpixels with the greatest value for domain adaptation learning. Further, we employ the cross-domain mixing and pseudo label with consistency regularization techniques respectively to address the domain shift and label noise problems. Extensive experimentation demonstrates that our proposed superpixel-level method utilizes a limited budget more efficiently than previous pixel-level techniques and surpasses state-of-the-art methods at 40x lower cost.

# 205

Strong Double Blind

ActionVOS: Actions as Prompts for Video Object Segmentation

LIANGYANG OUYANG · Ruicong Liu · Yifei Huang · Ryosuke Furuta · Yoichi Sato

Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes. To address these problems, this work proposes a novel action-aware RVOS setting called ActionVOS, aiming at segmenting only active objects in egocentric videos using human actions as a key language prompt. This is because human actions precisely describe the behavior of humans, thereby helping to identify the objects truly involved in the interaction and to understand possible state changes. We also build a method tailored to work under this specific setting. Specifically, we develop an action-aware labeling module with an efficient action-guided focal loss. Such designs enable ActionVOS model to prioritize active objects with existing readily-available annotations. Experimental results on VISOR dataset reveal that ActionVOS significantly reduces the mis-segmentation of inactive objects, confirming that actions help the ActionVOS model understand objects' involvement. Further evaluations on VOST and VSCOS datasets show that the novel ActionVOS setting enhances segmentation performance when encountering challenging circumstances involving object state changes.

# 146

Strong Double Blind

WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

xinjian wu · Ruisong Zhang · Jie Qin · Shijie Ma · Cheng-Lin Liu

Segmenting and recognizing a diverse range of object parts is crucial in various computer vision and robotic applications. While object segmentation has made significant progress, part-level segmentation remains an under-explored issue. Part segmentation entails discerning complex boundaries between parts, and the scarcity of annotated data further complicates the task. To tackle this problem, in this paper, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained vision foundation model, Segment Anything Model (SAM). WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. During its training phase, it only utilizes weakly supervised labels in the form of bounding boxes or points. Extensive experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations. Specifically, WPS-SAM achieves 68.93% mIOU and 79.53% mACC on the PartImageNet dataset, surpassing state-of-the-art fully supervised methods by approximately 4% in terms of mIOU.

# 16

Strong Double Blind

A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability

Linfeng Ma · Han Fang · Tianyi Wei · Zijin Yang · Zehua Ma · Weiming Zhang · Nenghai Yu

Robustness is the most important property of watermarking schemes. In practice, the watermarking mechanism shall be robust to both geometric and non-geometric distortions. In deep learning-based watermarking frameworks, robustness can be ensured by end-to-end training with different noise layers. However, most of the current CNN-based watermarking frameworks, even trained with targeted distortions, cannot well adapt to geometric distortions due to the architectural design. Since the traditional convolutional layer's position structure is relatively fixed, it lacks the flexibility to capture the influence of geometric distortion, making it difficult to train for corresponding robustness. To address such limitations, we propose a Swin Transformer and Deformable Convolutional Network (DCN)-based watermark model backbone. The attention mechanism and the deformable convolutional window effectively improve the feature processing flexibility, greatly enhancing the robustness, especially for geometric distortions. Besides, for non-geometric distortions, aiming at improving the generalizability for more distortions, we also provide a distortion-style-ensembled noise layer, including an image encoder, an image decoder, and distortion-style layers that can effectively simulate styles of different kinds of distortions. In the final watermark model training stage, we can simply train with our proposed noise layer for overall robustness. Extensive experiments illustrate that compared to existing state-of-the-art (SOTA) works, with better visual quality, our method achieves explicitly superior performance under tested geometric distortions, with 100.00% watermark extraction accuracy in almost all cases. Relevant codes will be released after review.

# 106

Strong Double Blind

COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

Liu He · Daniel Aliaga

The generation of large-scale urban layouts has garnered substantial interest across various disciplines. Prior methods have utilized procedural generation requiring manual rule coding or deep learning needing abundant data. However, prior approaches have not considered the context-sensitive nature of urban layout generation. Our approach addresses this gap by leveraging a canonical graph representation for the entire city, which facilitates scalability and captures the multi-layer semantics inherent in urban layouts. We introduce a novel graph-based masked autoencoder (GMAE) for city-scale urban layout generation. The method encodes attributed buildings, city blocks, communities and cities into a unified graph structure, enabling self-supervised masked training for graph autoencoder. Additionally, we employ scheduled iterative sampling for 2.5D layout generation, prioritizing the generation of important city blocks and buildings. Our approach achieves good realism, semantic consistency, and correctness across the heterogeneous urban styles in 330 US cities. Codes and datasets are released at: https://github.com/Arking1995/COHO.

# 180

Strong Double Blind

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Tien Toan Nguyen · Minh Nhat Nhat Vu · Baoru Huang · An Dinh Vuong · Quan Vuong · Ngan Le · Thieu Vo · Anh Nguyen

6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds. We first introduce Grasp-Anything-6D, a large-scale dataset for the language-driven 6-DoF grasp detection task with 1M point cloud scenes and more than 200M language-associated 3D grasp poses. We further introduce a novel diffusion model that incorporates a new negative prompt guidance learning strategy. The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones given the language input. Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language. Intensive experimental results show the effectiveness of our method in both benchmarking experiments and real-world scenarios, surpassing other baselines. In addition, we demonstrate the practicality of our approach in real-world robotic applications.

# 153

Strong Double Blind

Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views

Ningli Xu · Rongjun Qin

Predicting realistic ground views from satellite imagery in urban scenes is a challenging task due to the significant view gaps between satellite and ground-view images. We propose a novel pipeline to tackle this challenge, by generating geospecifc views that maximally respect the weak geometry and texture from multi-view satellite images. Different from existing approaches that hallucinate images from cues such as partial semantics or geometry from overhead satellite images, our method directly predicts ground-view images at geolocation by using a comprehensive set of information from the satellite image, resulting in ground-level images with a resolution boost at a factor of ten or more. We leverage a novel building refinement method to reduce geometric distortions in satellite data at ground level, which ensures the creation of accurate conditions for view synthesis using diffusion networks. Moreover, we proposed a novel geospecific prior, which prompts distribution learning of diffusion models to respect image samples that are closer to the geolocation of the predicted images. We demonstrate our pipeline is the first to generate close-to-real and geospecific ground views merely based on satellite images. Codes and data will be shared.

# 208

Strong Double Blind

MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery

Pei Zhou · Yanchao Yang

We aim to discover manipulation concepts embedded in the unannotated demonstrations, which are recognized as key physical states. The discovered concepts can facilitate training manipulation policies and promote generalization. Current methods relying on multimodal foundation models for deriving key states usually lack accuracy and semantic consistency due to limited multimodal robot data. In contrast, we introduce an information-theoretic criterion to characterize the regularities that signify a set of physical states. We also develop a framework that trains a concept discovery network using this criterion, thus bypassing the dependence on human semantics and alleviating costly human labeling. The proposed criterion is based on the observation that key states, which deserve to be conceptualized, often admit more physical constraints than non-key states. This phenomenon can be formalized as maximizing the mutual information between the putative key state and its preceding state, i.e., Maximal Mutual Information (MaxMI). By employing MaxMI, the trained key state localization network can accurately identify states of sufficient physical significance, exhibiting reasonable semantic compatibility with human perception. Furthermore, the proposed framework produces key states that lead to concept-guided manipulation policies with higher success rates and better generalization in various robotic tasks compared to the baselines, verifying the effectiveness of the proposed criterion.

# 32

Faceptor: A Generalist Model for Face Perception

Lixiong Qin · Mei Wang · Xuannan Liu · Yuhang Zhang · Wei Deng · Xiaoshuai Song · Weiran Xu · Weihong Deng

With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and model will be made publicly available.

# 218

Strong Double Blind

Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking

Jikai Zheng · Mingjiang Liang · Shaoli Huang · Jifeng Ning

Recent advancements in transformer-based light-weight object tracking have set new standards across various benchmarks due to their efficiency and effectiveness. Despite these achievements, most current trackers rely heavily on pre-existing object detection architectures without optimizing the backbone network to leverage the unique demands of object tracking. Addressing this gap, we introduce the Feature Extraction and Relation Modeling Tracker (FERMT) - a novel approach that significantly enhances tracking speed and accuracy. At the heart of FERMT is a strategic decomposition of the conventional attention mechanism into four distinct sub-modules within a one-stream tracker. This design stems from our insight that the initial layers of a tracking network should prioritize feature extraction, whereas the deeper layers should focus on relation modeling between objects. Consequently, we propose an innovative, light-weight backbone specifically tailored for object tracking. Our approach is validated through meticulous ablation studies, confirming the effectiveness of our architectural decisions. Furthermore, FERMT incorporates a Dual Attention Unit for feature pre-processing, which facilitates global feature interaction across channels and enriches feature representation with attention cues. Benchmarking on GOT-10k, FERMT achieves a groundbreaking Average Overlap (AO) score of 69.6%, outperforming the leading real-time trackers by 5.6% in accuracy while boasting a 54% improvement in CPU tracking speed. This work not only sets a new standard for state-of-the-art (SOTA) performance in light-weight tracking but also bridges the efficiency gap between fast and high-performance trackers.The code and model will be available soon.

# 145

Strong Double Blind

Learning Multimodal Latent Generative Models with Energy-Based Prior

Shiyu Yuan · Jiali Cui · Hanao Li · Tian Han

Multimodal models have gained increasing popularity recently. Many works have been proposed to learn the representations for different modalities. The representation can learn shared information from these domains, leading to increased and coherent joint and cross-generation. However, these works mainly considered standard Gaussian or Laplacian as their prior distribution. It can be challenging for the uni-modal and non-informative distribution to capture all the information from multiple data types. Meanwhile, energy-based models (EBM) have shown their effectiveness in multiple tasks due to their expressiveness and flexibility. But its capacity has yet to be discovered for the multimodal generative models. In this paper, we propose a novel framework to train multimodal latent generative models together with the energy-based models. The proposed method can lead to more expressive and informative prior which can better capture the information within multiple modalities. Our experiments showed that our model is effective and can increase generation coherence and latent classification for different multimodal datasets.

# 256

Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

Jiayun Wang · Yubei Chen · Stella Yu

Self-supervised learning (SSL) has proven effective in learning high-quality representations for various downstream tasks, with a primary focus on semantic tasks. However, its application in geometric tasks remains underexplored, partially due to the absence of a standardized evaluation method for geometric representations. To address this gap, we introduce a novel pose-estimation benchmark for assessing SSL geometric representations, which demands training without semantic or pose labels and achieving proficiency in both semantic and geometric downstream tasks. On this benchmark, we study enhancing SSL geometric representations without sacrificing semantic classification accuracy. We find that leveraging mid-layer representations improves pose-estimation performance by 10-20%. Further, we introduce an unsupervised trajectory-regularization loss, which improves performance by an additional 4% and improves generalization ability on out-of-distribution data. We hope the proposed benchmark and methods offer new insights and improvements in self-supervised geometric representation learning.

# 120

Strong Double Blind

SINDER: Repairing the Singular Defects of DINOv2

Haoqi Wang · Tong Zhang · Mathieu Salzmann

Vision Transformer models trained on large-scale datasets, although effective, often exhibit artifacts in the patch token they extract. While such defects can be alleviated by re-training the entire model with additional classification tokens, the underlying reasons for the presence of these tokens remain unclear. In this paper, we conduct a thorough investigation of this phenomenon, combining theoretical analysis with empirical observations. Our findings reveal that these artifacts originate from the pre-trained network itself, specifically stemming from the leading left singular vector of the network's weights. Furthermore, to mitigate these defects, we propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset, thereby avoiding the need for complete re-training. We validate our method on various downstream tasks, including unsupervised segmentation, classification, and supervised segmentation, demonstrating its effectiveness in improving model performance. Our code and checkpoints will be released.

# 148

Strong Double Blind

Emergent Visual-Semantic Hierarchies in Image-Text Representations

Morris Alper · Hadar Averbuch-Elor

While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge. We will release our data, code, and trained models.

# 181

Strong Double Blind

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Yang Liu · Pengxiang Ding · Siteng Huang · Min Zhang · Han Zhao · Donglin Wang

Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models (LVLMs) have emerged as a pivotal advancement, bridging the gap between image and text. However, video making it challenging for LVLMs to perform adequately due to the complexity of the relationship between language and spatial-temporal data structure. Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature, by general multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we explore fine-grained alignment approach via object trajectory for different modalities across both spatial and temporal dimensions simultaneously. Thus, we propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property. To achieve fine-grained video-language alignment, we curate a multi-modal pre-training dataset PiTe-143k, the dataset provision of moving trajectories in pixel level for all individual objects, that appear and mention in the video and caption both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates astounding capabilities on myriad video-related multi-modal tasks through beat the state-of-the-art methods by a large margin.

# 161

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

Yi Wang · Conrad M Albrecht · Nassim Ait Ali Braham · Chenying Liu · Zhitong Xiong · Xiao Xiang Zhu

The increasing availability of multi-sensor data sparks interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. Meanwhile, a simple residual deformable attention is introduced to help the model focus on modality-informative features. We evaluate DeCUR in three common multimodal scenarios ( radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent and significant improvement for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide insights and raise more interest in researching the hidden relationships of multimodal representations.

# 150

Denoising Vision Transformers

Jiawei Yang · Katie Luo · Jiefeng Li · Congyue Deng · Leonidas Guibas · Dilip Krishnan · Kilian Weinberger · Yonglong Tian · Yue Wang

We delve into a crucial yet often overlooked challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as segmentation, depth prediction, and object discovery. We trace this fundamental issue down to the positional embeddings at the input stage. we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight Transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our DVT does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and models will be released.

# 344

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Dogucan Yaman · Fevziye Irem Eyiokur Yaman · Leonard Bärmann · HAZIM KEMAL EKENEL · Alexander Waibel

Talking face generation aims to create a realistic video with accurate lip synchronization and high visual quality, using given audio and reference video, while preserving identity and visual characteristics. In this paper, we start by identifying several issues of existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss and SyncNet. We further tackle lip leaking problem from the identity reference and propose a silent-lip generator, aiming to prevent lip leaking by changing the lips of the identity reference. We then introduce stabilized synchronization loss and AVSyncNet to alleviate the problems caused by lip-sync loss and SyncNet. Finally, we present adaptive triplet loss to enhance visual quality and apply a post-processing technique to obtain high-quality videos. According to the experiments, our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions as well as their complementary effects.

# 342

ScanTalk: 3D Talking Heads from Unregistered Scans

Federico Nocentini · Thomas Besnier · Claudio Ferrari · Sylvain Arguillere · Stefano Berretti · Mohamed Daoudi

Speech-driven 3D talking heads generation has emerged as a significant area of interest among researchers, presenting numerous challenges. Existing methods are constrained by animating faces with fixed topologies, wherein point-wise correspondence is established, and the number and order of points remains consistent across all identities the model can animate. In this work, we present ScanTalk, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data. Our approach relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations. By leveraging the power of DiffusionNet, ScanTalk not only adapts to diverse facial structures but also maintains fidelity when dealing with scanned data, thereby enhancing the authenticity and versatility of generated 3D talking heads. Through comprehensive comparisons with state-of-the-art methods, we validate the efficacy of our approach, demonstrating its capacity to generate realistic talking heads comparable to existing techniques. While our primary objective is to develop a generic method free from topological constraints, all state-of-the-art methodologies are bound by such limitations. Code for reproducing our results, and the pre-trained model will be made available.

# 339

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Yu Deng · Duomin Wang · Baoyuan Wang

In this paper, we propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis. Different from existing methods that often learn from reconstructing monocular videos guided by 3DMM, we employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner, avoiding reliance on inaccurate 3DMM reconstruction that could be detrimental to the synthesis performance. The key idea is to first learn a 3D head synthesizer using synthetic multi-view images to convert monocular real videos into multi-view ones, and then utilize the pseudo multi-view videos to learn a 4D head synthesizer via cross-view self-reenactment. By leveraging a simple vision transformer backbone with motion-aware cross-attentions, our method exhibits superior performance compared to previous methods in terms of reconstruction fidelity, geometry consistency, and motion control accuracy. We hope our method offers novel insights into integrating 3D priors with 2D supervisions for improved 4D head avatar creation.

# 341

Fast Registration of Photorealistic Avatars for VR Facial Animation

Chaitanya Patel · Shaojie Bai · Te-Li Wang · Jason Saragih · Shih-En Wei

Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines.

# 337

Strong Double Blind

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Yushuo Chen · Zerong Zheng · Zhe Li · Chao Xu · Yebin Liu

We present a novel pipeline for learning triangular human avatars from multi-view videos. Recent methods for avatar learning are typically based on neural radiance fields (NeRF), which is not compatible with traditional graphics pipeline and poses great challenges for operations like editing or synthesizing under different environments. To overcome these limitations, our method represents the avatar with an explicit triangular mesh extracted from an implicit SDF field, complemented by an implicit material field conditioned on given poses. Leveraging this triangular avatar representation, we incorporate physics-based rendering to accurately decompose geometry and material. To enhance both the geometric and appearance details, we further employ a 2D UNet as the network backbone and introduce pseudo normal ground-truth as additional supervision. Experiments show that our method can learn triangular avatars with high-quality geometry reconstruction and material decomposition, inherently supporting editing, manipulation or relighting operations.

# 340

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

Taekyung Ki · Dongchan Min · Gyeongsu Chae

In this paper, we present Export3D, a one-shot 3D-aware portrait animation method that is able to control the facial expression and camera view of a given portrait image. To achieve this, we introduce a tri-plane generator that directly generates a tri-plane of 3D prior by transferring the expression parameter of 3DMM into the source image. The tri-plane is then decoded into the image of different view through a differentiable volume rendering. Existing portrait animation methods heavily rely on image warping to transfer the expression in the motion space, challenging on disentanglement of appearance and expression. In contrast, we propose a contrastive pre-training framework for appearance-free expression parameter, eliminating undesirable appearance swap when transferring a cross-identity expression. Extensive experiments show that our pre-training framework can learn the appearance-free expression representation hidden in 3DMM, and our model can generate 3D-aware expression controllable portrait image without appearance swap in the cross-identity manner.

# 332

Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams

Liwen Hu · gang ding · Mianzhi Liu · Lei Ma · Tiejun Huang

Spike camera with high temporal resolution can fire continuous binary spike streams to record per-pixel light intensity. By using reconstruction methods, the scene details in high-speed scenes can be restored from spike streams. However, existing methods struggle to perform well in low-light environments due to insufficient information in spike streams. To this end, we propose a bidirectional recurrent-based reconstruction framework to better handle such extreme conditions. In more detail, a \textbf{l}ight-\textbf{r}obust \textbf{rep}resentation (LR-Rep) is designed to aggregate temporal information in spike streams. Moreover, a fusion module is used to extract temporal features. Besides, we synthesize a reconstruction dataset for high-speed low-light scenes where light sources are carefully designed to be consistent with reality. The experiment shows the superiority of our method. Importantly, our method also generalizes well to real spike streams. Our project is: https://github.com/Acnext/Learning-to-Robustly-Reconstruct-Dynamic-Scenes-from-Low-light-Spike-Streams/.

# 335

Strong Double Blind

Wavelength-Embedding-guided Filter-Array Transformer for Spectral Demosaicing

haijin zeng · Hiep Luong · Wilfried Philips

Spectral imaging offers the capability to unveil hidden details within the world around us. However, to fully harness this potential, it is imperative to develop effective spectral demosaicing techniques. Despite the success of learning based spectral demosaicing methods, three challenges hinder their practical use. Firstly, existing convolutional neural networks and attention-based models, struggle to capture spectral similarities and long-range dependencies. Secondly, their performance is unstable when optical characteristics, like multispectral filter array (MSFA) arrangement and wavelength distribution, change. Lastly, they lack a structured approach to incorporating imaging system physics, such as MSFA pattern. Addressing these challenges, our paper introduces the Wavelength Embedding guided Filter Array Attention Transformer (WeFAT) for effective spectral demosaicing. Specifically, akin to timestep embedding in denoising diffusion models, we propose a Wavelength Embedding guided Multi-head Self-Attention (We-MSA) mechanism to imbue our model with wavelength memory, facilitating adaptation to diverse cameras. This approach treats each spectral feature as a token, directly integrating wavelength information into attention calculation. Additionally, we developed a MSFA-attention Mechanism (MaM) steering We-MSA to focus on spatial regions yielding high-quality spectral data. Experimental results affirm that WeFAT exhibits strong performance consistency across diverse cameras characterized by varying spectral distributions and MSFA patterns, trained solely on ARAD dataset. It also outperforms current state-of-the-art methods in both simulated and real datasets.

# 11

Strong Double Blind

Learned HDR Image Compression for Perceptually Optimal Storage and Display

Peibei Cao · HAOYU CHEN · Jingzhe Ma · Yu-Chieh Yuan · Zhiyong Xie · Xin Xie · Haiqing Bai · Kede Ma

High dynamic range imaging records natural scenes with closer to actual luminance distributions, at the cost of increased storage requirements and display demands. Consequently, HDR image compression for perceptually optimal storage and display is crucial, yet it remains inadequately addressed. In this work, we take steps towards this goal. Specifically, we learn to compress HDR images into two bitstreams for storage, one of which is used to generate low dynamic range (LDR) images for display purposes conditioned on the maximum luminance of the scene, while the other serves as side information to aid HDR image reconstruction from the generated LDR image. To measure the perceptual quality of the displayable LDR image, we employ the normalized Laplacian pyramid distance (NLPD), a perceptually quality metric that supports the use of the input HDR image as reference. To measure the perceptual quality of the reconstructed HDR image, we employ a newly proposed HDR quality metric based on a simple inverse display model that enables high-fidelity dynamic range expansion at all luminance levels. Comprehensive qualitative and quantitative comparisons on various HDR scenes demonstrate the perceptual optimality of our learned HDR image compression system for both displayable LDR images and reconstructed HDR images at all bit rates.

# 316

Strong Double Blind

Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging

In Cho · Hyunbo Shim · Seon Joo Kim

This paper aims to facilitate more practical NLOS imaging by reducing the number of samplings and scan areas. To this end, we introduce a phasor-based enhancement network that is capable of predicting clean and full measurements from noisy partial observations. We leverage a denoising autoencoder scheme to acquire rich and noise-robust representations in the measurement space. Through this pipeline, our enhancement network is trained to accurately reconstruct complete measurements from their corrupted and partial counterparts. However, we observe that the \naive application of denoising often yields degraded and over-smoothed results, caused by unnecessary and spurious frequency signals present in measurements. To address this issue, we introduce a phasor-based pipeline designed to limit the spectrum of our network to the frequency range of interests, where the majority of informative signals are detected. The phasor wavefronts at the aperture, which are band-limited signals, are employed as inputs and outputs of the network, guiding our network to learn from the frequency range of interests and discard unnecessary information. The experimental results in more practical acquisition scenarios demonstrate that we can look around the corners with 16× or 64× fewer samplings and 4× smaller apertures.

# 328

Leveraging Thermal Modality to Enhance Reconstruction in Low-Light Conditions

Jiacong Xu · Mingqian Liao · Ram Prabhakar Kathirvel · Vishal Patel

Neural Radiance Fields (NeRF) accomplishes photo-realistic novel view synthesis by learning the implicit volumetric representation of a scene from multi-view images, which faithfully convey the colorimetric information. However, sensor noises will contaminate low-value pixel signals, and the lossy camera image signal processor will further remove near-zero intensities in extremely dark situations, deteriorating the synthesis performance. Existing approaches reconstruct low-light scenes from raw images but struggle to recover texture and boundary details in dark regions. Additionally, they are unsuitable for high-speed models relying on explicit representations. To address these issues, we present Thermal-NeRF, which takes thermal and visible raw images as inputs, considering the thermal camera is robust to the illumination variation and raw images preserve any possible clues in the dark, to accomplish visible and thermal view synthesis simultaneously. Also, the first multi-view thermal and visible dataset (MVTV) is established to support the research on multimodal NeRF. Thermal-NeRF achieves the best trade-off between detail preservation and noise smoothing and provides better synthesis performance than previous work. Finally, we demonstrate that both modalities are beneficial to each other in 3D reconstruction.

# 326

The Sky's the Limit: Relightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility

James Gardner · Evgenii Kashin · Bernhard Egger · William Smith

Inverse rendering of outdoor scenes from unconstrained image collections is a challenging task, particularly illumination/albedo ambiguities and occlusion of the illumination environment (shadowing) caused by geometry. However, there are many cues in an image that can aid in the disentanglement of geometry, albedo and shadows. Whilst sky is frequently masked out in state-of-the-art methods, we exploit the fact that any sky pixel provides a direct observation of distant lighting in the corresponding direction and, via a neural illumination prior, a statistical cue to derive the remaining illumination environment. The incorporation of our illumination prior is enabled by a novel `outside-in' method for computing differentiable sky visibility based on a neural directional distance function. This is highly efficient and can be trained in parallel with the neural scene representation, allowing gradients from appearance loss to flow from shadows to influence the estimation of illumination and geometry. Our method estimates high-quality albedo, geometry, illumination and sky visibility, achieving state-of-the-art results on the NeRF-OSR relighting benchmark.

# 323

Strong Double Blind

A Probability-guided Sampler for Neural Implicit Surface Rendering

Gonçalo José Dias Pais · Valter André Piedade · Moitreya Chatterjee · Marcus Greiff · Pedro Miraldo

Several variants of Neural Radiance Fields (NeRFs) have significantly improved the accuracy of synthesized images and surface reconstruction of 3D scenes/objects. In all of these methods, a key characteristic is that none can train the neural network with every possible input data, specifically, every pixel and potential 3D point along the projection rays due to scalability issues. While vanilla NeRFs uniformly sample both the image pixels and 3D points along the projection rays, some variants focus only on guiding the sampling of the 3D points along the projection rays. In this paper, we leverage the implicit surface representation of the foreground scene and model a probability density function in a 3D image projection space to achieve a more targeted sampling of the rays toward regions of interest, resulting in improved rendering. Additionally, a new surface reconstruction loss is proposed for improved performance. This new loss fully explores the proposed 3D image projection space model and incorporates near-to-surface and empty space components. By integrating our novel sampling strategy and novel loss into any current state-of-the-art neural implicit surface renderer, we achieve more accurate and detailed 3D reconstructions and improved image rendering, especially for the regions of interest in any given scene. The code will be made available.

# 336

REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices

Chaojie Ji · Yufeng Li · Yiyi Liao

This work tackles the challenging task of achieving real-time novel view synthesis on various scenes, including highly reflective objects and unbounded outdoor scenes. Existing real-time rendering methods, especially those based on meshes, often have subpar performance in modeling surfaces with rich view-dependent appearances. Our key idea lies in leveraging meshes for rendering acceleration while incorporating a novel approach to parameterize view-dependent information. We decompose the color into diffuse and specular, and model the specular color in the reflected direction based on a neural environment map. Our experiments demonstrate that our method achieves comparable reconstruction quality for highly reflective surfaces compared to state-of-the-art offline methods, while also efficiently enabling real-time rendering on edge devices such as smartphones.

# 331

Strong Double Blind

Dynamic Neural Radiance Field From Defocused Monocular Video

Xianrui Luo · Huiqiang Sun · Juewen Peng · Zhiguo Cao

Dynamic Neural Radiance Field (NeRF) from monocular videos has recently been explored for space-time novel view synthesis and achieved excellent results. However, defocus blur caused by depth variation often occurs in video capture, compromising the quality of dynamic reconstruction because the lack of sharp details interferes with modeling temporal consistency between input views. To tackle this issue, we propose \ourmethod, the first dynamic NeRF method designed to restore sharp novel views from defocused monocular videos. We introduce layered Depth-of-Field (DoF) volume rendering to model the defocus blur and reconstruct a sharp NeRF supervised by defocused views. The blur model is inspired by the connection between DoF rendering and volume rendering. The opacity in volume rendering aligns with the inter-layer visibility in layered DoF rendering. To execute the blurring, we modify the layered blur kernel to the ray-based kernel and employ an optimized sparse kernel to gather the input rays efficiently and render the optimized rays with our layered DoF volume rendering. We synthesize a dataset with defocused dynamic scenes for our task, and extensive experiments on our dataset show that our method outperforms existing approaches in synthesizing all-in-focus novel views from defocus blur while maintaining spatial-temporal consistency in the scene.

# 330

Strong Double Blind

VersatileGaussian: Real-time Neural Rendering for Versatile Tasks using Gaussian Splatting

Renjie Li · Zhiwen Fan · Bohua Wang · Peihao Wang · Zhangyang Wang · Xi Wu

The acquisition of multi-task (MT) labels in 3D scenes is crucial for a wide range of real-world applications. Traditional methods generally employ an analysis-by-synthesis approach, generating 2D label maps on novel synthesized views, or utilize Neural Radiance Field (NeRF), which concurrently represents label maps. Yet, these approaches often struggle to balance inference efficiency with MT label quality. Specifically, they face limitations such as (a) constrained rendering speeds due to NeRF pipelines, and (b) the implicit representation of MT fields that can result in continuity artifacts during rendering. Recently, 3D Gaussian Splatting has shown promise in achieving real-time rendering speeds without compromising rendering quality. In our research, we address the challenge of enabling 3D Gaussian Splatting to represent Versatile MT labels. Simply attaching MT attributes to explicit Gaussians compromises rendering quality due to the lack of cross-task information flow during optimization. We introduce architectural and rasterizer design to effectively overcome this issue. Our VersatileGaussian model innovatively associates Gaussians with shared MT features and incorporates a feature map rasterizer. The cornerstone of this versatile rasterization is the Task Correlation Attention module, which fosters cross-task correlations through a soft weighting mechanism that disseminates task-specific knowledge. Across experiments on the ScanNet and Replica datasets shows that VersatileGaussian not only sets a new benchmark in MT accuracy but also maintains real-time rendering speeds (35 FPS). Importantly, this model design facilitates mutual benefits across tasks, leading to improved quality in novel view synthesis.

# 329

Strong Double Blind

DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic Scenes

Jing-Wen Yang · Jia-Mu Sun · Yong-Liang Yang · Jie Yang · Ying Shan · Yanpei Cao · LIN GAO

Neural Radiance Fields (NeRF) have achieved remarkable progress on dynamic scenes with deformable objects. Nonetheless, most of the previous works required multi-view inputs or long training time (several hours), making it hard to apply them for real-world scenarios. Recently, a series of works have been dedicated to addressing the blurry artifact present in synthesized novel views given a monocular input of dynamic scenes. However, they may fail to predict stable and accurate deformation while keeping high-frequency details when rendering at various resolutions. To this end, we introduce a novel framework DMiT (Deformable Mipmapped Tri-Plane) that adopts the mipmaps to render dynamic scenes with various resolutions from novel views. With the help of hierarchical mipmapped triplanes, we incorporate an MLP to effectively predict a mapping between the observation space and the canonical space, enabling not only high-fidelity dynamic scene rendering but also high-performance training and inference. Moreover, a training scheme for joint geometry and deformation refinement is designed for canonical regularization to reconstruct high-quality geometries. Extensive experiments on both synthetic and real dynamic scenes demonstrate the efficacy and efficiency of our method.

# 334

Strong Double Blind

NeRF-XL: NeRF at Any Scale with Multi-GPU

Ruilong Li · Sanja Fidler · Angjoo Kanazawa · Francis Williams

In this paper, we first revisit the existing approach of decomposing large-scale scenes into multiple independently trained Neural Radiance Fields (NeRFs), and identify several fundamental issues that hinder performance improvement with additional computational resources (GPUs), which contradicts the fundamental objective of leveraging multi-GPU setups to enhance large-scale NeRF performance. Subsequently, we introduce NeRF-XL, a principled algorithm designed to efficiently harness multi-GPU setups for performance improvement, thereby enabling NeRF at any scale. At its core, our method allocates non-overlapping NeRFs across disjoint spatial regions and optimizes them jointly across GPUs. We reduce the GPU communication overhead by rewriting the volume rendering equation and relevant loss terms, enhancing training and rendering efficiency. Without any heuristics, our approach gracefully reveals scaling laws for NeRFs in the multi-GPU setting across various types of data and scales, including the first time NeRF reconstruction on the largest open-source dataset to date, MatrixCity, with 258K images covering a 25km^2 city area.

# 297

Strong Double Blind

G2fR: Frequency Regularization in Grid-based Feature Encoding Neural Radiance Fields

Shuxiang Xie · Shuyi Zhou · Ken Sakurada · Ryoichi Ishikawa · Masaki Onishi · Takeshi Oishi

Neural Radiance Field (NeRF) methodologies have garnered considerable interest, particularly with the introduction of grid-based feature encoding (GFE) approaches such as Instant-NGP and TensoRF. Conventional NeRF employs positional encoding (PE) and represents a scene with a Multi-Layer Perceptron (MLP). Frequency regularization has been identified as an effective strategy to overcome primary challenges in PE-based NeRFs, including dependency on known camera poses and the requirement for extensive image datasets. While several studies have endeavored to extend frequency regularization to GFE approaches, there is still a lack of basic theoretical foundations for these methods. Therefore, we first clarify the underlying mechanisms of frequency regularization. Subsequently, we conduct a comprehensive investigation into the expressive capability of GFE-based NeRFs and attempt to connect frequency regularization with GFE methods. Moreover, we propose a generalized strategy, G2fR: Generalized Grid-based Frequency Regularization, to address issues of camera pose optimization and few-shot reconstruction with GFE methods. We validate the efficacy of our methods through an extensive series of experiments employing various representations across diverse scenarios.

# 314

Strong Double Blind

InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction

Xulong Wang · Siyan Dong · Youyi Zheng · Yanchao Yang

3D surface reconstruction from multi-view images is essential for scene understanding and interaction. However, complex indoor scenes pose challenges such as ambiguity due to limited observations. Recent implicit surface representations, such as Neural Radiance Fields (NeRFs) and signed distance functions (SDFs), employ various geometric priors to resolve the lack of observed information. Nevertheless, their performance heavily depends on the quality of the pre-trained geometry estimation models. To ease such dependence, we propose regularizing the geometric modeling by explicitly encouraging the mutual information between surface normals of two highly correlated points. In this way, the geometry learning process is modulated by the second-order correlations from noisy (first-order) geometric priors, thus eliminating the bias due to poor generalization. Additionally, we introduce a simple yet effective scheme that utilizes semantic and geometric features to identify correlated points, enhancing their mutual information accordingly. The proposed technique can serve as a plugin for SDF-based neural surface representations. Our experiments demonstrate the effectiveness of the proposed in improving the surface reconstruction quality of major states of the arts, and we will make our code publicly available to support future research.

# 322

MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections

Jiayue Liu · Tang Xiao · Freeman Cheng · Zihao Yang · Zhihao Li · Jianzhuang Liu · Yi Huang · Jiaqi Lin · Shiyong Liu · Xiaofei Wu · Xu Songcen · Chun Yuan

3D Gaussian Splatting showcases notable advancements in photo-realistic and real-time novel view synthesis. However, it faces challenges in modeling mirror reflections, which exhibit substantial appearance variations from different viewpoints. To tackle this problem, we present MirrorGaussian, the first method for mirror scene reconstruction with real-time rendering based on 3D Gaussian Splatting. The key insight is grounded on the mirror symmetry between the real-world space and the virtual mirror space. We introduce an intuitive dual-rendering strategy that enables differentiable rasterization of both the real-world 3D Gaussians and the mirrored counterpart obtained by reflecting the former about the mirror plane. All 3D Gaussians are jointly optimized with the mirror plane in an end-to-end framework. MirrorGaussian achieves high-quality and real-time rendering in scenes with mirrors, empowering scene editing like adding new mirrors and objects. Comprehensive experiments on multiple datasets demonstrate that our approach significantly outperforms existing methods, achieving state-of-the-art results.

# 312

Strong Double Blind

Disentangled Generation and Aggregation for Robust Radiance Fields

Shihe Shen · Huachen Gao · Wangze Xu · Rui Peng · Luyang Tang · Kaiqiang Xiong · Jianbo Jiao · Ronggang Wang

The utilization of the triplane-based radiance fields has gained attention in recent years due to its ability to effectively disentangle 3D scenes with a high-quality representation and low computation cost. A key requirement of this method is the precise input of camera poses. However, due to the local update property of the triplane, a similar joint estimation as previous joint pose-NeRF optimization works easily results in local minima. To this end, we propose the Disentangled Triplane Generation module to introduce global feature context and smoothness into triplane learning, which mitigates errors caused by local updating. Then, we propose the Disentangled Plane Aggregation to mitigate the entanglement caused by the common triplane feature aggregation during camera pose updating. In addition, we introduce a two-stage warm-start training strategy to reduce the implicit constraints caused by the triplane generator. Quantitative and qualitative results demonstrate that our proposed method achieves state-of-the-art performance in novel view synthesis with noisy or unknown camera poses, as well as efficient convergence of optimization. Code will be available soon.

# 319

CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

Avinash Paliwal · Wei Ye · Jinhui Xiong · Dmytro Kotovenko · Rakesh Ranjan · Vikas Chandra · Nima Khademi Kalantari

The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of fast training and real-time inference while improving the reconstruction quality. Although the current 3DGS approach works well for dense input images, the unstructured point-cloud like representation quickly overfits to the more challenging setup of sparse training images (e.g., 3 images), creating a representation that appears as a jumble of needles from novel views. We propose to solve this issue by regularized optimization and depth-based initialization. Specifically, we optimize the Gaussian blobs to smoothly and independently deform different object surfaces to compensate for the inaccuracies of the initialization by utilizing an implicit convolutional decoder and a total variation loss. To support our regularized optimization, we initialize a 3D Gaussian representation from each input view through a novel technique that utilizes monocular depth. We demonstrate significant improvements in terms of recovering scene geometry and texture compared to state-of-the-art sparse-view NeRF-based approaches on a variety of scenes.

# 318

SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians

Hiba Dahmani · Moussab Bennehar · Nathan Piasco · Luis G Roldao Jimenez · Dzmitry Tsishkou

Implicit neural representation methods have shown impressive advancements in learning 3D scenes from unstructured in-the-wild photo collections but are still limited by the large computational cost of volumetric rendering. More recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios. Nevertheless, this technique suffers from poor performance on unstructured in-the-wild data. To tackle this, we extend over 3D Gaussian Splatting to handle unstructured image collections. We achieve this by modeling appearance to seize photometric variations in the rendered images. Additionally, we introduce a new mechanism to train transient Gaussians to handle the presence of scene occluders in an unsupervised manner. Experiments on diverse photo collection scenes and multi-pass acquisition of outdoor landmarks show the effectiveness of our method over prior works achieving state-of-the-art results with improved efficiency.

# 311

Strong Double Blind

Surface Reconstruction for 3D Gaussian Splatting via Local Structural Hints

Qianyi Wu · Jianmin Zheng · Jianfei Cai

This paper presents a novel approach for surface mesh reconstruction from 3D Gaussian Splatting (3DGS), a technique renowned for its efficiency in novel view synthesis but challenged for surface reconstruction. The key obstacle is the lack of geometry hints to regulate the optimization of millions of unorganized Gaussian blobs to align to the true surface. This paper introduces local structural hints during training to address the challenge. We first leverage the prior knowledge from monocular normal and depth estimations to refine the covariance and mean of Gaussian primitives, enhancing their organization and providing crucial normal information for surface extraction. However, due to the highly discrete nature of Gaussian primitives, such geometry guidance remains insufficient for the alignment with the true surface. We then propose to construct a signed distance field by a moving least square (MLS) function over the Gaussians in each local region. More importantly, we further propose to jointly learn a neural implicit network to mimic and regularize the MLS function. The joint optimization helps the optimization of Gassuain Splatting towards accurate surface alignment. Extensive experimental results demonstrate the effectiveness of our method in achieving superior mesh quality compared with the SoTA surface reconstruction for 3DGS. Our code will be released.

# 317

Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting

Zheng Zhang · WENBO HU · Yixing Lao · Tong He · Hengshuang ZHAO

3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis results while advancing real-time rendering performance. However, it relies heavily on the quality of the initial point cloud, resulting in blurring and needle-like artifacts in areas with insufficient initializing points. This is mainly attributed to the point cloud growth condition in 3DGS that only considers the average gradient magnitude of points from observable views, thereby failing to grow enough points in these areas. To this end, we propose a novel method, named Pixel-GS, to take into account the number of pixels covered by the Gaussian in each view during the computation of the growth condition. We regard the covered pixel numbers as the weights to dynamically average the gradients from different views. Doing so can mitigate the issue of large Gaussians participating in calculations across too many viewpoints, where a significant number of these viewpoints involve calculations for only a few pixels in boundary regions, leading to smaller gradients and consequently lowering the average gradient magnitude. As a result, points within the areas with insufficient initializing points can be grown more effectively, leading to a more accurate and detailed reconstruction. Besides, we also propose a simple yet effective strategy to scale the gradient field according to the distance to the camera, to suppress the growth of floaters near the camera. Extensive experiments both qualitatively and quantitatively demonstrate that our method achieves state-of-the-art rendering quality while maintaining real-time rendering speed, on the challenging Mip-NeRF 360 and Tanks & Temples datasets. The source code will be publicly available.

# 320

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Kai Zhang · Sai Bi · Hao Tan · Yuanbo Xiangli · Nanxuan Zhao · Kalyan Sunkavalli · Zexiang Xu

We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in ~0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks.

# 321

SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting

Richard Shaw · Michal Nazarczuk · Song Jifei · Arthur Moreau · Sibi Catley-Chandar · Helisa Dhamo · Eduardo Pérez Pellitero

Novel view synthesis has shown rapid progress recently, with methods capable of producing increasingly photorealistic results. 3D Gaussian Splatting has emerged as a promising method, producing high-quality renderings of scenes and enabling interactive viewing at real-time frame rates. However, it is limited to static scenes. In this work, we extend 3D Gaussian Splatting to reconstruct dynamic scenes. We model a scene's dynamics using dynamic MLPs, learning deformations from temporally-local canonical representations to per-frame 3D Gaussians. To disentangle static and dynamic regions, tuneable parameters weigh each Gaussian's respective MLP parameters, improving the dynamics modelling of imbalanced scenes. We introduce a sliding window training strategy that partitions the sequence into smaller manageable windows to handle arbitrary length scenes while maintaining high rendering quality. We propose an adaptive sampling strategy to determine appropriate window size hyperparameters based on the scene's motion, balancing training overhead with visual quality. Training a separate dynamic 3D Gaussian model for each sliding window allows the canonical representation to change, enabling the reconstruction of scenes with significant geometric changes. Temporal consistency is enforced using a fine-tuning step with self-supervising consistency loss on randomly sampled novel views. As a result, our method produces high-quality renderings of general dynamic scenes with competitive quantitative performance, which can be viewed in real-time in our dynamic interactive viewer.

# 301

Strong Double Blind

An Adaptive Screen-Space Meshing Approach for Normal Integration

Moritz Heep · Eduard Zell

Reconstructing surfaces from normals is a key component of photometric stereo. This work introduces an adaptive surface triangulation in the image domain and performs the normal integration on a triangular mesh afterwards. Our key insight is that surface curvature can be computed from normals. Based on curvature, we identify flat areas and aggregate pixels into triangles. The approximation quality is controlled by a single user parameter facilitating a seamless generation of low to high-resolution meshes. Compared to pixel grids, our triangle meshes adapt locally to surface details and allow for a sparser representation. Our new mesh-based formulation of the normal integration problem is strictly derived from discrete differential geometry and leads to well-conditioned linear systems. Results on real and synthetic data show that 10 to 100 times fewer vertices than pixels can be achieved. Experiments suggest that this sparsity translates into a sublinear runtime in the number of pixels. For 64 MP normal maps, our meshing first approach generates and integrates meshes in minutes while pixel-based approaches require hours for the integration alone.

# 325

Fast View Synthesis of Casual Videos with Soup-of-Planes

Yao-Chih Lee · Zhoutong Zhang · Kevin Blackburn-Matzen · Simon Niklaus · Jianming Zhang · Jia-Bin Huang · Feng Liu

Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

# 307

Strong Double Blind

4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Feng Cheng · Mi Luo · Huiyu Wang · Alex Dimakis · Lorenzo Torresani · Gedas Bertasius · Kristen Grauman

We present 4Diff, a 3D-aware diffusion model addressing the exo-to-ego viewpoint translation problem. This task involves generating first-person (egocentric) view images from third-person (exocentric) images. Leveraging the diffusion model's ability to generate photorealistic images, we propose a transformer-based diffusion model that incorporates geometry priors via the proposed mechanisms: (i) egocentric prior rendering and (ii) 3D-aware rotary cross-attention. The former integrates egocentric layout cues through point cloud rasterization, while the latter incorporates exocentric semantic features by guiding attention between diffusion model feature maps and exocentric semantic features, considering their geometric relationships. Our experiments on the challenging and diverse Ego-Exo4D multiview dataset demonstrate superior performance compared to state-of-the-art approaches. Notably, our approach exhibits robust generalization to novel environments not encountered during training. The code and pretrained models will be made public.

# 303

GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Xiao Fu · Wei Yin · Mu Hu · Kaixuan Wang · Yuexin Ma · Ping Tan · Shaojie Shen · Dahua Lin · Xiaoxiao Long

We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes, e.g., depth and normals, from single images. While significant research has already been conducted in this area, the progress has been substantially limited by the low diversity and poor quality of publicly available datasets. As a result, the prior works either are constrained to limited scenarios or suffer from the inability to capture geometric details. In this paper, we demonstrate that generative models, as opposed to traditional discriminative models (e.g., CNNs and Transformers), can effectively address the inherently ill-posed problem. We further show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage. Specifically, we extend the original stable diffusion model to jointly predict depth and normal, allowing mutual information exchange and high consistency between the two representations. More importantly, we propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions. This strategy enables our model to recognize different scene layouts, capturing 3D geometry with remarkable fidelity. GeoWizard sets new benchmarks for zero-shot depth and normal prediction, significantly enhancing many downstream applications such as 3D reconstruction, 2D content creation, and novel viewpoint synthesis.

# 93

Viewpoint textual inversion: discovering scene representations and 3D view control in 2D diffusion models

James Burgess · Kuan-Chieh Wang · Serena Yeung-Levy

Text-to-image diffusion models generate impressive and realistic images, but do they learn to represent the 3D world from only 2D supervision? We demonstrate that yes, certain 3D scene representations are encoded in the text embedding space of models like Stable Diffusion. Our approach, Viewpoint Neural Textual Inversion (ViewNeTI), is to discover \textit{3D view tokens}; these tokens control the 3D viewpoint --- the rendering pose in a scene --- of generated images. Specifically, we train a small neural mapper to take continuous camera viewpoint parameters and predict a view token (a word embedding); this token conditions diffusion generation via cross-attention to produce images with the desired camera viewpoint. Using ViewNeTI as an evaluation tool, we report two findings: first, the text latent space has a continuous view-control manifold for particular 3D scenes; second, we find evidence for a generalized view-control manifold for all scenes. We conclude that since the view token controls the 3D `rendering' viewpoint, there is likely a scene representation embedded in frozen 2D diffusion models. Finally, we exploit the 3D scene representations for 3D vision tasks, namely, view-controlled text-to-image generation, and novel view synthesis from a single image, where our approach sets state-of-the-art for LPIPS.

# 302

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

Yongwei Chen · Tengfei Wang · Tong Wu · Xingang Pan · Kui Jia · Ziwei Liu

Generating high-quality 3D assets from a given image is highly desirable in various applications such as AR/VR. Recent advances in single-image 3D generation explore feed-forward models that learn to infer the 3D model of an object without optimization. Though promising results have been achieved in single object generation, these methods often struggle to model complex 3D assets that inherently contain multiple objects. In this work, we present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. 1) We first perform an in-depth analysis of this "multi-object gap" from both model and data perspectives. 2) Next, with reconstructed 3D models of different objects, we seek to adjust their sizes, rotation angles, and locations to create a 3D asset that matches the given image. 3) To automate this process, we apply spatially-aware score distillation sampling (SSDS) from pretrained diffusion models to guide the positioning of objects. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling, and thus achieves more accurate results. Extensive experiments validate ComboVerse achieves clear improvements over existing methods in generating compositional 3D assets.

# 324

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Yushi Lan · Fangzhou Hong · Shuai Yang · Shangchen Zhou · Xuyi Meng · Bo Dai · Xingang Pan · Chen Change Loy

The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and perceptually equivalent latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

# 296

External Knowledge Enhanced 3D Scene Generation from Sketch

Zijie Wu · Mingtao Feng · Yaonan Wang · He Xie · Weisheng Dong · Bo Miao · Ajmal Mian

Generating realistic 3D scenes is challenging due to the complexity of room layouts and object geometries. We propose a sketch based knowledge enhanced diffusion architecture (SEK) for generating customized, diverse, and plausible 3D scenes. SEK conditions the denoising process with a hand-drawn sketch of the target scene and cues from an object relationship knowledge base. We first construct an external knowledge base containing object relationships and then leverage knowledge enhanced graph reasoning to assist our model in understanding hand-drawn sketches. A scene is represented as a combination of 3D objects and their relationships, and then incrementally diffused to reach a Gaussian distribution. We propose a 3D denoising scene transformer that learns to reverse the diffusion process, conditioned by a hand-drawn sketch along with knowledge cues, to regressively generate the scene including the 3D object instances as well as their layout. Experiments on the 3D-FRONT dataset show that our model improves FID, CKL by 17.41\%, 37.18\% in 3D scene generation and FID, KID by 19.12\%, 20.06\% in 3D scene completion compared to the nearest competitor DiffuScene.

# 309

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

Guangyao Zhai · Evin Pınar Örnek · Dave Zhenyu Chen · Ruotong Liao · Yan Di · Nassir Navab · Federico Tombari · Benjamin Busam

We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models will be released.

# 315

Strong Double Blind

3DEgo: 3D Editing on the Go!

Umar Khalid · Hasan Iqbal · Azib Farooq · Jing Hua · Chen Chen

We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed \textit{noise blender module} for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset.

# 305

Strong Double Blind

Learning Pseudo 3D Guidance for View-consistent Texturing with 2D Diffusion

Kehan Li · Yanbo Fan · Yang Wu · Zhongqian Sun · Wei Yang · Xiangyang Ji · Li Yuan · Jie Chen

Text-driven 3D texturing requires the generation of high-fidelity texture that conforms to given geometry and description. Recently, the high-quality text-to-image generation ability of 2D diffusion model has significantly promoted this task, by converting it into a texture optimization process guided by multi-view synthesized images, where the generation of high-quality and multi-view consistency images becomes the key issue. State-of-the-art methods achieve the consistency between different views by treating image generation on a novel view as image inpainting conditioned on the texture generated by previously views. However, due to the accumulated semantic divergence of local inpainting and the occlusion between object parts on sparse views, these inpainting-based methods often fail to deal with long-range texture consistency. To address these, we present P3G, a texturing approach based on learned Pseudo 3D Guidance. The key idea of P3G is to first learn a coarse but consistent texture, to serve as a global semantics guidance for encouraging the consistency between images generated on different views. To this end, we incorporate pre-trained text-to-image diffusion models and multi-view optimization to achieve propagating accurate semantics globally for leaning the guidance, and design an efficient framework for high-quality and multi-view consistent image generation that integrates the learned semantic guidance. Quantitative and qualitative evaluation on variant 3D shapes demonstrates the superiority of our P3G on both consistency and overall visual quality.

# 82

Strong Double Blind

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

ChenHan Jiang · Yihan Zeng · Tianyang Hu · Xu Songcen · Wei Zhang · Hang Xu · Dit-Yan Yeung

Score Distillation Sampling (SDS) by well-trained 2D diffusion models has shown great promise in text-to-3D generation. However, this paradigm distills view-agnostic 2D image distributions into the rendering distribution of 3D representation for each view independently, overlooking the coherence across views and yielding 3D inconsistency in generations. In this work, we propose \textbf{J}oint \textbf{S}core \textbf{D}istillation (JSD), a new paradigm that ensures coherent 3D generations. Specifically, we model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model. We then derive the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS. In addition, we instantiate three universal view-aware models as energy functions, demonstrating compatibility with JSD. Empirically, JSD significantly mitigates the 3D inconsistency problem in SDS, while maintaining text congruence. Moreover, we introduce the Geometry Fading scheme and Classifier-Free Guidance (CFG) Switching strategy to enhance generative details. Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation, achieving outstanding results with an 88.5\% CLIP R-Precision and 27.7\% CLIP Score. These metrics demonstrate exceptional text congruence, as well as remarkable geometric consistency and texture fidelity.

# 100

Diverse Text-to-3D Synthesis with Augmented Text Embedding

Uy Tran · Minh N. Hoang Luu · Phong Nguyen · Khoi Nguyen · Binh-Son Hua

Text-to-3D synthesis has recently emerged as a new approach to sampling 3D models by adopting pretrained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, which motivates us to devise a new method that considers the joint generation of different 3D models from the same text prompt. We propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis qualitatively and quantitatively.

# 308

Strong Double Blind

SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers

Mingrui Zhao · Yizhi Wang · Fenggen Yu · Changqing Zou · Ali Mahdavi-Amiri

Shape abstraction is an important task for simplifying complex geometric structures while retaining essential features. Sweep surfaces, commonly found in human-made objects, aid in this process by effectively capturing and representing object geometry, thereby facilitating abstraction. In this paper, we introduce SweepNet, a novel approach to shape abstraction through sweep surfaces. We propose an effective parameterization for sweep surfaces, utilizing superellipses for profile representation and B-spline curves for the axis. This compact representation, requiring as few as 14 float numbers, facilitates intuitive and interactive editing while preserving shape details effectively. Additionally, by introducing a differentiable neural sweeper and an encoder-decoder architecture, we demonstrate the ability to predict swept volume representations without supervision. We show the superiority of our model through several quantitative and qualitative experiments throughout the paper.

# 217

Strong Double Blind

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Sifan Wu · Amir Hosein Khasahmadi · Mor Katz · Pradeep Kumar Jayaraman · Yewen Pu · Karl D.D. Willis · Bang Liu

Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design. The code is available at https://anonymous.4open.science/r/CadVLM.

# 240

Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Mengting Chen · Xi Chen · Zhonghua Zhai · Chen Ju · Xuewen Hong · Jinsong Lan · Shuai Xiao

This paper introduces a novel framework for virtual try-on, termed Wear-Any-Way. Different from previous methods, Wear-Any-Way is “customizable”. Besides generating high-fidelity results, our method supports users to precisely control the wearing style. To achieve this goal, we first construct a strong pipeline, supporting single/multiple garment try-on and model-to model try-on in complicated scenarios. To make it manipulable, we propose sparse correspondence alignment and involve point-based control to guide the generation. Wear-Any-Way gets state- of-the-art performance for the standard setting and provides a novel interaction form for customizing the wearing style. For instance, it supports users to drag the sleeve to make it rolled up, drag the coat to make it open, and utilize clicks to control the style of tuck, etc. Wear-Any-Way enables more liberated and flexible expressions of the attires, which holds profound implications in the fashion industry.

# 310

Strong Double Blind

DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

Yoshiyasu Yusuke · Leyuan Sun

This paper presents DiffSurf, a transformer-based denoising diffusion model for generating and reconstructing 3D surfaces. Specifically, we design a diffusion transformer architecture that predicts noise from noisy 3D surface vertices and normals. With this architecture, DiffSurf is able to generate 3D surface meshes in various poses and shapes, such as human bodies, hands, animals and man-made objects. Further, DiffSurf is versatile in that it can address various 3D downstream tasks including morphing, body shape variation and 3D human mesh fitting to 2D keypoints. Experimental results on 3D human model benchmarks demonstrate that DiffSurf can generate shapes with greater diversity and higher quality than previous generative models. Furthermore, when applied to the task of single-image 3D human mesh recovery, DiffSurf achieves accuracy comparable to prior techniques at a near real-time rate.

# 338

Strong Double Blind

Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

Jaehyeok Kim · Dongyoon Wee · Dan Xu

This paper introduces Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF), a framework designed to perform free-viewpoint rendering of monocular human videos via novel non-rigid motion modeling approach. In the context of dynamic clothed humans, complex cloth dynamics generate non-rigid motions that are intrinsically distinct from skeletal articulations and critically important for the rendering quality. The conventional approach models non-rigid motions as spatial (3D) deviations in addition to skeletal transformations. However, it is either time-consuming or challenging to achieve optimal quality due to its high learning complexity without a direct supervision. To target this problem, we propose a novel approach of modeling non-rigid motions as radiance residual fields to benefit from more direct color supervision in the rendering and utilize the rigid radiance fields as a prior to reduce the complexity of the learning process. Our approach utilizes a single multiresolution hash encoding (MHE) to concurrently learn the canonical T-pose representation from rigid skeletal motions and the radiance residual field for non-rigid motions. Additionally, to further improve both training efficiency and usability, we extend MoCo-NeRF to support simultaneous training of multiple subjects within a single framework, thanks to our effective design for modeling non-rigid motions. This scalability is achieved through the integration of a global MHE and learnable identity codes in addition to multiple local MHEs. We present extensive results on ZJU-MoCap and MonoCap, clearly demonstrating state-of-the-art performance in both single- and multi-subject settings. The code and model will be made publicly available.

# 299

Strong Double Blind

LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

Archana Swaminathan · Anubhav Anubhav · Kamal Gupta · Shishira R Maiya · Vatsal Agarwal · Abhinav Shrivastava

Neural Radiance Fields (NeRFs) have revolutionized the reconstruction of static scenes and objects in 3D, offering unprecedented quality. However, extending NeRFs to model dynamic objects or object articulations remains a challenging problem. Previous works have tackled this issue by focusing on part-level reconstruction and motion estimation for objects, but they often rely on heuristics regarding the number of moving parts or object categories, which can limit their practical use. In this work, we introduce LEIA, a novel approach for representing dynamic 3D objects. Our method involves observing the object at distinct time steps or "states" and conditioning a hypernetwork on the current state, using this to parameterize our NeRF. This approach allows us to learn a view-invariant latent representation for each state. We further demonstrate that by interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen. Our experimental results highlight the effectiveness of our method in articulating objects in a manner that is independent of the viewing angle and joint configuration. Notably, our approach outperforms previous methods that rely on motion information for articulation registration.

# 291

Strong Double Blind

Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction

Mykhaylo Andriluka · Baruch Tabanpour · Daniel Freeman · Cristian Sminchisescu

We propose a novel neural network approach, LARP (Learned Articulated Rigid body Physics), to model the dynamics of articulated human motion with contact. Our goal is to develop a faster and more convenient methodological alternative to traditional physics simulators for use in computer vision tasks such as human motion reconstruction from video. To that end we introduce a training procedure and model components that support the construction of a recurrent neural architecture to accurately simulate articulated rigid body dynamics. Our neural architecture supports features typically found in traditional physics simulators, such as modeling of joint motors, variable dimensions of body parts, contact between body parts and objects, and is an order of magnitude faster than traditional systems when multiple simulations are run in parallel. To demonstrate the value of LARP we use it as a drop-in replacement for a state of the art classical non-differentiable simulator in an existing video-based reconstruction framework and show comparative or better 3D human pose reconstruction accuracy.

# 199

Strong Double Blind

Data Collection-free Masked Video Modeling

Yuchi Ishikawa · Masayoshi Kondo · Yoshimitsu Aoki

Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the promising ways to solve these issues, yet pre-training solely on synthetic data has its own challenges. In this paper, we introduce an effective self-supervised learning framework for videos that leverages readily available and less costly static images. Specifically, we define the Pseudo Motion Generator (PMG) module that recursively applies image transformations to generate pseudo-motion videos from images. These pseudo-motion videos are then leveraged in masked video modeling. Our approach is applicable to synthetic images as well, thus entirely freeing video pre-training from data collection costs and other concerns in real data. Through experiments in action recognition tasks, we demonstrate that this framework allows effective learning of spatio-temporal features through pseudo-motion videos, significantly improving over existing methods which also use static images and partially outperforming those using both real and synthetic videos. These results uncover fragments of what video transformers learn through masked video modeling.

# 298

Strong Double Blind

Vista3D: unravel the 3d darkside of a single image

Qiuhong Shen · Xingyi Yang · Michael Bi Mi · Xinchao Wang

We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this,we present \textbf{Vista3D}, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase and the fine phase. In the coarse phase, we rapidly generate initial geometry with Gaussian Splatting from a single image. In the fine phase, we extract a Signed Distance Function (SDF) directly from learned Gaussian Splatting, optimizing it with a differentiable isosurface representation. Furthermore, it elevates the quality of generation by using a disentangled representation with two independent implicit functions to capture both visible and obscured aspects of objects. Additionally, it harmonizes gradients from 2D diffusion prior with 3D-aware diffusion priors by angular diffusion prior composition. Through extensive evaluation, we demonstrate that Vista3D effectively sustains a balance between the consistency and diversity of the generated 3D objects. We will make all code and results publicly available.

# 288

Diff-Reg: Diffusion Model in Doubly Stochastic Matrix Space for Registration Problem

Qianliang Wu · Haobo Jiang · Lei Luo · Jun Li · Yaqing Ding · Jin Xie · Jian Yang

Establishing reliable correspondences is essential for registration tasks such as 3D and 2D3D registration. Existing methods commonly leverage geometric or semantic point features to generate potential correspondences. However, these features may face challenges such as large deformation, scale inconsistency, and ambiguous matching problems (e.g., symmetry). Additionally, many previous methods, which rely on single-pass prediction, may struggle with local minima in complex scenarios. To mitigate these challenges, we introduce a diffusion matching model for robust correspondence construction. Our approach treats correspondence estimation as a denoising diffusion process within the doubly stochastic matrix space, which gradually denoises (refines) a doubly stochastic matching matrix to the ground-truth one for high-quality correspondence estimation. It involves a forward diffusion process that gradually introduces Gaussian noise into the ground truth matching matrix and a reverse denoising process that iteratively refines the noisy matching matrix. In particular, the feature extraction from the backbone occurs only once during the inference phase. Our lightweight denoising module utilizes the same feature at each reverse sampling step. Evaluation of our method on both 3D and 2D3D registration tasks confirms its effectiveness. The code will be made available online.

# 293

NICP: Neural ICP for 3D Human Registration at Scale

Riccardo Marin · Enric Corona · Gerard Pons-Moll

Aligning a template to 3D human point clouds is a long-standing problem crucial for tasks like animation, reconstruction, and enabling supervised learning pipelines. Recent data-driven methods leverage predicted surface correspondences; however, they are not robust to varied poses, identities, or noise. In contrast, industrial solutions often rely on expensive manual annotations or multi-view capturing systems. Recently, neural fields have shown promising results. Still, their purely data-driven and extrinsic nature does not incorporate any guidance toward the target surface, often resulting in a trivial misalignment of the template registration. Currently, no method can be considered the standard for 3D Human registration, limiting the scalability of downstream applications. In this work, we propose NSR, a pipeline that, for the first time, generalizes and scales across thousands of shapes and more than ten different data sources. Our essential contribution is NICP, an ICP-style self-supervised task tailored to neural fields. NICP takes a few seconds, is self-supervised, and works out of the box on pre-trained neural fields. We combine it with a localized Neural Field trained on a large MoCap dataset. NSR achieves the state of the art over public benchmarks, and the release of its code and checkpoints will provide the community with a powerful tool useful for many downstream tasks like dataset alignments, cleaning, or asset animation.

# 279

Strong Double Blind

TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds

Dupont Elona · Kseniya Cherenkova · Dimitrios Mallis · Gleb A Gusev · Anis Kacem · Djamila Aouada

3D reverse engineering, in which a CAD model is inferred given a 3D scan of a physical object, is a research direction that offers many promising practical applications. This paper proposes TransCAD, an end-to-end transformer-based architecture that predicts the CAD sequence from a point cloud. TransCAD leverages the structure of CAD sequences by using a hierarchical learning strategy. A loop refiner is also introduced to regress sketch primitive parameters. Rigorous experimentation on the DeepCAD and Fusion360 datasets show that TransCAD achieves state-of-the-art results. The result analysis is supported with a proposed metric for CAD sequence, the mean Average Precision of CAD Sequence, that addresses the limitations of existing metrics.

# 275

Strong Double Blind

EINet: Point Cloud Completion via Extrapolation and Interpolation

Pingping Cai · Canyu Zhang · LINGJIA SHI · Lili Wang · Nasrin Imanpour · Song Wang

Scanned point clouds are often sparse and incomplete due to the limited field of view of sensing devices, significantly impeding the performance of downstream applications. Therefore, the task of point cloud completion is introduced to obtain a dense and complete point cloud from the incomplete input. The fundamental challenges in tackling this task involve accurately inferring the missing shapes and upsampling them to higher densities. In this paper, we propose a novel approach to address this task, which formulates the completion task as a dual problem: a feature-wise extrapolation problem, where the shape features of the partial point cloud are extrapolated to outlier regions for the recovery of missing portions, and a feature-wise interpolation problem to achieve point cloud upsampling. Based on these, we propose the EINet, a new point cloud completion paradigm with a novel Extrapolation module that can predict the missing shapes for the partial point cloud and a newly designed Interpolation module to upsample the point cloud. Extensive evaluation results demonstrate that EINet achieves compelling performance compared to previous state-of-the-art methods.

# 276

DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction

YANLONG LI · Chamara Madarasingha · Kanchana Thilakarathna

Point cloud streaming is increasingly getting popular, evolving into the norm for interactive service delivery and the future Metaverse. However, the substantial volume of data associated with point clouds presents numerous challenges, particularly in terms of high bandwidth consumption and large storage capacity. Despite various solutions proposed thus far, with a focus on point cloud compression, upsampling, and completion, these reconstruction-related methods continue to fall short in delivering high fidelity point cloud output. As a solution, in DiffPMAE, we propose an effective point cloud reconstruction architecture. Inspired by self-supervised learning concepts, we combine Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data. By the nature of this reconstruction process, DiffPMAE can be extended to many related downstream tasks including point cloud compression, upsampling and completion. Leveraging ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the performance of DiffPMAE exceeding many state-of-the-art methods in-terms of auto-encoding and downstream tasks considered.

# 274

Strong Double Blind

Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning

Ray Zhang · Zheming Zhou · Min Sun · Omid Ghasemalizadeh · Cheng-Hao Kuo · Ryan M. Eustice · Maani Ghaffari Jadidi · Arnie Sen

This paper introduces a robust unsupervised SE(3) point cloud registration method that operates without requiring point correspondences. The method frames point clouds as functions in a reproducing kernel Hilbert space (RKHS), leveraging SE(3) equivariant features for direct feature space registration. A novel RKHS distance metric is proposed, offering reliable performance amidst noise, outliers, and asymmetrical data. An unsupervised training approach is introduced to effectively handles limited ground truth data, facilitating adaptation to real datasets. The proposed method outperforms traditional supervised methods in terms of registration accuracy on both synthetic (ModelNet) and real-world (ETH-3D) noisy, outlier-rich datasets, marking the first instance of successful real RGB-D odometry data registration using an equivariant method. The code will be made available upon publication.

# 269

Strong Double Blind

CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection

Jinhao Deng · Wei Ye · Hai Wu · Qiming Xia · Xun Huang · Xin Li · Jin Fang · Wei Li · Chenglu Wen · Cheng Wang

Point cloud data, representing the precise 3D layout of the scene, quickly drives the research of 3D object detection. However, the challenge arises due to the rapid iteration of 3D sensors, which leads to significantly different distributions in point clouds. This, in turn, results in subpar performance of 3D cross-sensor object detection. This paper introduces a Cross Mechanism Dataset, named CMD, to support research tackling this challenge. CMD is the first domain adaptation dataset, comprehensively encompassing diverse mechanical sensors and various scenes for 3D object detection. In terms of sensors, CMD includes 32-beam LiDAR, 128-beam LiDAR, solid-state LiDAR, 4D millimeter-wave radar, and cameras, all of which are well-synchronized and calibrated. Regarding the scenes, CMD consists of 50 sequences collocated from different scenarios, ranging from campuses to highways. Furthermore, we validated the effectiveness of various domain adaptation methods in mitigating sensor-based domain differences. We also proposed a DIG method to reduce domain disparities from the perspectives of Density, Intensity, and Geometry, which effectively bridges the domain gap between different sensors. The experimental results on the CMD dataset show that our proposed DIG method outperforms the state-of-the-art techniques, demonstrating the effectiveness of our baseline method. The dataset and the corresponding code are available at https://github.com/im-djh/CMD.

# 248

Strong Double Blind

Formula-Supervised Visual-Geometric Pre-training

Ryosuke Yamada · Kensho Hara · Hirokatsu Kataoka · Koshi Makihara · Nakamasa Inoue · Rio Yokota · Yutaka Satoh

Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP's superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning.

# 267

Strong Double Blind

Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning

Ali Cheraghian · Zeeshan Hayder · Sameeea Ramasinghe · Shafin Rahman · Javad Jafaryahya · Lars Petersson · Mehrtash Harandi

In recent years, robust pre-trained foundation models have been successfully used in many downstream tasks. Here, we would like to use such powerful models to address the problem of few-shot class incremental learning (FSCIL) tasks on 3D point cloud objects. Our approach is to reprogram the well-known CLIP-based foundation model (trained on 2D images and text pairs) for this purpose. The CLIP model works by ingesting 2D images, so to leverage it in our context, we project the 3D object point cloud onto 2D image space to create proper depth maps. For this, prior works consider a fixed and non-trainable set of camera poses. In contrast, we propose to train the network to find a projection that best describes the object and is appropriate for extracting 2D image features from the CLIP vision encoder. Directly using the generated depth map is not suitable for the CLIP model, so we apply the model reprogramming paradigm to the depth map to augment the foreground and background to adapt it. This removes the need for modification or fine-tuning of the foundation model. In the setting we have investigated, we have limited access to data from novel classes, resulting in a problem with overfitting. Here, we address this problem via the use of a prompt engineering approach using multiple GPT-generated text descriptions. Our method, C3PR, successfully outperforms existing FSCIL methods on ModelNet, ShapeNet, ScanObjectNN, and CO3D datasets.

# 285

Strong Double Blind

Raising the Ceiling: Conflict-Free Local Feature Matching with Dynamic View Switching

Xiaoyong Lu · Songlin Du

Current feature matching methods prioritize improving modeling capabilities to better align outputs with ground-truth matches, which are the theoretical upper bound on matching results, metaphorically depicted as the “ceiling”. However, these enhancements fail to address the underlying issues that directly hinder ground-truth matches, including the scarcity of matchable points in small scale images, matching conflicts in dense methods, and the keypoint-repeatability reliance in sparse methods. We propose a novel feature matching method named RCM, which Raises the Ceiling of Matching from three aspects. 1) RCM introduces a dynamic view switching mechanism to address the scarcity of matchable points in source images by strategically switching image pairs. 2) RCM proposes a conflict-free coarse matching module, addressing matching conflicts in the target image through a many-to-one matching strategy. 3) By integrating the semi-sparse paradigm and the coarseto- fine architecture, RCM preserves the benefits of both high efficiency and global search, mitigating the reliance on keypoint repeatability. As a result, RCM enables more matchable points in the source image to be matched in an exhaustive and conflict-free manner in the target image, leading to a substantial 260% increase in ground-truth matches. Comprehensive experiments show that RCM exhibits remarkable performance and efficiency in comparison to state-of-the-art methods.

# 292

DGD: Dynamic 3D Gaussians Distillation

Isaac Labe · Noam Issachar · Itai Lang · Sagie Benaim

We tackle the task of learning dynamic 3D semantic radiance fields given a single monocular video as input. Our learned semantic radiance field captures per-point semantics as well as color and geometric properties for a dynamic 3D scene, enabling the generation of novel views and their corresponding semantics. This enables the segmentation and tracking of a diverse set of 3D semantic entities, specified using a simple and intuitive interface that includes a user click or a text prompt. To this end, we present DGD, a unified 3D representation for both the appearance and semantics of a dynamic 3D scene, building upon the recently proposed dynamic 3D Gaussians representation. Our representation is optimized over time with both color and semantic information. Key to our method is the joint optimization of the appearance and semantic attributes, which jointly affect the geometric properties of the scene. We evaluate our approach in its ability to enable dense semantic 3D object tracking and demonstrate high-quality results that are fast to render, for a diverse set of scenes. Our code will be made available upon acceptance.

# 295

Strong Double Blind

SHIC: Shape-Image Correspondences with no Keypoint Supervision

Aleksandar Shtedritski · Christian Rupprecht · Andrea Vedaldi

Canonical surface mapping is a generalization of keypoint detection with the goal of assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SICK, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SICK reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.

# 289

Strong Double Blind

LineFit: A Geometric Approach for Fitting Line Segments in Images

Marion BOYER · David Youssefi · Florent Lafarge

We present LineFit, an algorithm that fits line segments from a predicted image gradient map. While existing detectors aim at capturing line segments on line-like structures, our algorithm also seeks to approximate curved shapes. This particularity is interesting for addressing vectorization problems with edge-based representations, after connecting the detected line segments. Our algorithm measures and optimizes the quality of a line segment configuration globally as a point-to-line fitting problem. The quality of configurations is measured through the local fitting error, the completeness over the image gradient map and the capacity to preserve geometric regularities. A key ingredient of our work is an efficient and scalable exploration mechanism that refines an initial configuration by ordered sequences of geometric operations. We show the potential of our algorithm when combined with recent deep image gradient predictions and its competitiveness against existing detectors on different datasets, especially when scenes contain curved objects. We also demonstrate the benefit of our algorithm for polygonalizing objects.

# 282

Strong Double Blind

Global Structure-from-Motion Revisited

Linfei Pan · Daniel Barath · Marc Pollefeys · Johannes L Schönberger

Reconstructing 3D structure and camera motion from images has been a long-standing focus of computer vision research and is commonly referred to as Structure-from-Motion (SfM). Solutions to this problem are categorized into two main approaches: incremental and global. While the most popular systems follow the incremental paradigm due to superior accuracy and robustness, global approaches are drastically more scalable and efficient. We revisit the problem of global SfM and propose GLOMAP as a new general-purpose system. In terms of accuracy and robustness, we achieve results comparable to COLMAP, the most widely used incremental SfM, while being orders of magnitude faster. GLOMAP significantly outperforms state-of-the-art global SfM (Theia, OpenMVG). The code will be made available as open-source.

# 151

Strong Double Blind

Robust Fitting on a Gate Quantum Computer

Frances Yang · Michele Sasdelli · Tat-Jun Chin

Gate quantum computers generate significant interest due to their potential to solve certain difficult problems such as prime factorization in polynomial time. Computer vision researchers have long been attracted to the power of quantum computers. Robust fitting, which is fundamentally important to many computer vision pipelines, has recently been shown to be amenable to gate quantum computing. The previous proposed solution was to compute Boolean influence as a measure of outlyingness using the Bernstein-Vazirani quantum circuit. However, the method assumed a quantum implementation of an $\ell_\infty$ feasibility test, which has not been demonstrated. In this paper, we take a big stride towards quantum robust fitting: we propose a quantum circuit to solve the $\ell_\infty$ feasibility test in the 1D case, which allows to demonstrate for the first time quantum robust fitting on a real gate quantum computer, the IonQ Aria. We also show how 1D Boolean influences can be accumulated to compute Boolean influences for higher-dimensional non-linear models, which we experimentally validate on real benchmark datasets.

# 290

The Nerfect Match: Exploring NeRF Features for Visual Localization

Qunjie Zhou · Maxim Maximov · Or Litany · Laura Leal-Taixe

In this work, we propose the use of Neural Radiance Fields (NeRF) as a scene representation for visual localization. Recently, NeRF has been employed to enhance pose regression and scene coordinate regression models by augmenting the training database, providing auxiliary supervision through rendered images, or serving as an iterative refinement module. We extend its recognized advantages -- its ability to provide a compact scene representation with realistic appearances and accurate geometry -- by exploring the potential of NeRF's internal features in establishing precise 2D-3D matches for localization. To this end, we conduct a comprehensive examination of NeRF's implicit knowledge, acquired through view synthesis, for matching under various conditions. This includes exploring different matching network architectures, extracting encoder features at multiple layers, and varying training configurations. Significantly, we introduce NeRFMatch, an advanced 2D-3D matching function that capitalizes on the internal knowledge of NeRF learned via view synthesis. Our evaluation of NeRFMatch on standard localization benchmarks, within a structure-based pipeline, sets a new state-of-the-art for localization performance on Cambridge Landmarks.

# 294

Strong Double Blind

A Cephalometric Landmark Regression Method based on Dual-encoder for High-resolution X-ray Image

Chao Dai · Wang Yang · Chaolin Huang · zhou jiakai · Qilin Xu · Minpeng Xu

Accurate detection of cephalometric landmarks is crucial for orthodontic diagnosis and treatment planning. Current methods rely on a cascading form of multiple models to achieve higher accuracy, which greatly complicates both training and deployment processes. In this paper, we introduce a novel regression paradigm capable of simultaneously detecting all cephalometric landmarks in high-resolution X-ray images. Our approach only utilizes the encoder module from the transformer to design a dual-encoder architecture, enabling precise detection of cephalometric landmark positions from coarse to fine. Specifically, the entire model architecture comprises three main components: a feature extractor module, a reference encoder module, and a finetune encoder module. These components are respectively responsible for feature extraction and fusion for X-ray images, coarse localization of cephalometric landmark, and fine-tuning of cephalometric landmark positioning. Notably, our framework is fully end-to-end differentiable and innately learns to exploit the interdependencies among cephalometric landmarks. Experiments demonstrate that our method significantly surpasses the current state-of-the-art methods in Mean Radical Error (MRE) and the 2mm Success Detection Rate (SDR) metrics, while also reducing computational resource consumption. Our code will be available soon.

# 280

FoundPose: Unseen Object Pose Estimation with Foundation Features

Evin Pınar Örnek · Yann Labbé · Bugra Tekin · Lingni Ma · Cem Keskin · Christian Forster · Tomas Hodan

We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models, without requiring any object- nor task-specific training. In contrast, existing methods typically pre-train on large-scale, task-specific datasets in order to generalize to new objects and to bridge the image-to-model domain gap. We demonstrate that such generalization capabilities can be observed in a recent vision foundation model trained in a self-supervised manner. Specifically, our method estimates the object pose from image-to-model 2D-3D correspondences, which are established by matching patch descriptors from the recent DINOv2 model between the image and pre-rendered object templates. We find that reliable correspondences can be established by kNN matching of patch descriptors from an intermediate DINOv2 layer. Such descriptors carry stronger positional information than descriptors from the last layer, and we show their importance when semantic information is ambiguous due to object symmetries or a lack of texture. To avoid establishing correspondences against all object templates, we develop an efficient template retrieval approach that integrates the patch descriptors into the bag-of-words representation, and can promptly propose a handful of similarly looking templates. Additionally, we apply featuremetric alignment to compensate for discrepancies in the 2D-3D correspondences caused by coarse patch sampling. The resulting method noticeably outperforms existing RGB methods for refinement-free pose estimation on the standard BOP benchmark with seven diverse datasets, and can be seamlessly combined with an existing render-and-compare refinement method to achieve RGB-only state-of-the-art results. Source code will be available upon acceptance.

# 219

Strong Double Blind

PoseSOR: Human Pose Can Guide Our Attention

Huankang Guan · Rynson W.H. Lau

Salient Object Ranking (SOR) aims to study how humans shift their attention among various objects within a scene. Previous works attempt to excavate explicit visual saliency cues, e.g., spatial frequency and semantic context, to tackle this challenge. However, these visual saliency cues may fall short in handling real-world scenarios, which often involve various human activities and interactions. We observe that human observers' attention can be reflexively guided by the poses and gestures of the people in the scene, which indicate their activities. For example, observers tend to shift their attention to follow others' head orientation or running/walking direction to anticipate what will happen. Inspired by this observation, we propose to explore the human skeletal pose to deeply understand high-level interactions between human participants and their surroundings for robust salient object ranking. Specifically, we propose PoseSOR, a human pose-aware SOR model for the SOR task, with two novel modules: 1) a Pose-Aware Interaction (PAI) Module to integrate human pose knowledge into salient object queries for learning high-level interactions, and 2) a Pose-Driven Ranking (PDR) Module to apply pose knowledge as directional cues to help predict where human attention will shift to. To our knowledge, our approach is the first to explore human pose for salient object ranking. Extensive experiments demonstrate the effectiveness of our method for SOR, particularly on complex scenes, and our model sets the new state-of-the-art on the SOR benchmarks. The code will be made publicly available.

# 258

A Graph-Based Approach for Category-Agnostic Pose Estimation

Or Hirschorn · Shai Avidan

Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a few-shot single model, requiring minimal support images with annotated keypoints. We present a significant departure from conventional CAPE techniques, which treat keypoints as isolated entities, by treating the input pose data as a graph. We leverage the inherent geometrical relations between keypoints through a graph-based network to break symmetry, preserve structure, and better handle occlusions. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning over 100 categories. Our solution boosts performance by 0.98% under a 1-shot setting, achieving a new state-of-the-art for CAPE. Additionally, we enhance the dataset with skeleton annotations. Our code and data are publicly available.

# 260

Strong Double Blind

3DSA:Multi-View 3D Human Pose Estimation With 3D Space Attention Mechanisms

Po Han Chen · Chia-Chi Tsai

In this study, we introduce the 3D space attention module (3DSA) as a novel approach to address the drawback of multi-view 3D human pose estimation methods, which fail to recognize the object's significance from diverse viewpoints. Specifically, we utilize the 3D space subdivision algorithm to divide the feature volume into multiple regions. Predicted 3D space attention scores are assigned to the different regions to construct the feature volume with space attention.The purpose of the 3D space attention module is to distinguish the significance of individual regions within the feature volume by applying weighted attention adjustments derived from corresponding viewpoints. We conduct experiments on existing voxel-based methods, VoxelPose and Faster VoxelPose.By incorporating the space attention module, both achieve state-of-the-art performance on the Panoptic 3D Human Pose Estimation.

# 257

Strong Double Blind

HPE-Li: WiFi-enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation

Gian Toan D. · Tien Dac Lai · Thien Van Luong · Kok-Seng Wong · Van-Dinh Nguyen

WiFi-based human pose estimation (HPE) has emerged as a promising alternative to conventional vision-based techniques, yet faces the high computational cost hindering its widespread adoption. This paper introduces a novel HPE-Li approach that harnesses multi-modal sensors (e.g. camera and WiFi) to generate accurate 3D skeletal in HPE. We then develop an efficient deep neural network to process raw WiFi signals. Our model incorporates a distinctive multi-branch convolutional neural network (CNN) empowered by a selective kernel attention (SKA) mechanism. Unlike standard CNNs with fixed receptive fields, the SKA mechanism is capable of dynamically adjusting kernel sizes according to input data characteristics, enhancing adaptability without increasing complexity. Extensive experiments conducted on two MM-Fi and WiPose datasets underscore the superiority of our method over state-of-the-art approaches, while ensuring minimal computational overhead, rendering it highly suitable for large-scale scenarios.

# 259

Strong Double Blind

HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation

WENCAN CHENG · Eun-Ji Kim · Jong Hwan Ko

The extraction of keypoint positions from input hand frames, known as 3D hand pose estimation, is crucial for various human-computer interaction applications. However, current approaches often struggle with the dynamic nature of self-occlusion of hands and intra-occlusion with interacting objects. To address this challenge, this paper proposes the Denoising Adaptive Graph Transformer, HandDAGT, for hand pose estimation. The proposed HandDAGT leverages a transformer structure to thoroughly explore effective geometric features from input patches. Additionally, it incorporates a novel attention mechanism to adaptively weigh the contribution of kinematic correspondence and local geometric features for the estimation of specific keypoints. This attribute enables the model to adaptively employ kinematic and local information based on the occlusion situation, enhancing its robustness and accuracy. Furthermore, we introduce a novel denoising training strategy aimed at improving the model's robust performance in the face of occlusion challenges. Experimental results show that the proposed model significantly outperforms the existing methods on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://anonymous.4open.science/r/HandDAGT-C313.

# 261

WHAC: World-grounded Humans and Cameras

Wanqi Yin · Zhongang Cai · Chen Wei · Fanzhou Wang · Ruisi Wang · Haiyi Mei · Weiye Xiao · Zhitao Yang · Qingping Sun · Atsushi yamashita · Ziwei Liu · Lei Yang

Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available at https://wqyin.github.io/projects/WHAC/.

# 255

Strong Double Blind

EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset

Amy Zhao · Chengcheng Tang · Lezi Wang · Yijing Li · Mihika Dave · Lingling Tao · Christopher D. Twigg · Robert Y. Wang

Accurate tracking of a user’s body pose while wearing a virtual reality (VR), augmented reality (AR) or mixed reality (MR) headset is a prerequisite for authentic self-expression, natural social presence, and intuitive user interfaces. Existing body tracking approaches on VR/AR devices are either under-constrained, e.g., attempting to infer full body pose from only headset and controller pose, or require impractical hardware setups that place cameras far from a user’s face to improve body visibility. In this paper, we present the first controllerless egocentric body tracking solution that runs on an actual VR device using the same cameras that are used for SLAM tracking. We propose a novel egocentric tracking architecture that models the temporal history of body motion using multi-view latent features. Furthermore, we release the first large-scale real-image dataset for egocentric body tracking, EgoBody3M, with a realistic VR headset configuration and diverse subjects and motions. Benchmarks on the dataset shows that our approach outperforms other state-of-the-art methods in both accuracy and smoothness of the resulting motion. We perform ablation studies on our model choices and demonstrate the method running in realtime on a VR headset. Our dataset with more than 30 hours of recordings and 3 million frames will be made publicly available.

# 253

Strong Double Blind

3D Human Pose Estimation via Non-Causal Retentive Networks

Kaili Zheng · Feixiang Lu · Yihao Lv · Liangjun Zhang · Chenyi Guo · Ji Wu

Temporal dependencies are essential in 3D human pose estimation to mitigate depth ambiguity. Previous methods typically use a fixed-length sliding window to capture these dependencies. However, they treat past and future frames equally, ignoring the fact that relying on too many future frames increases the inference latency. In this paper, we present a 3D human pose estimation model based on Retentive Networks (RetNet) that incorporates temporal information by utilizing a large number of past frames and a few future frames. The Non-Causal RetNet (NC-RetNet) is designed to allow the originally causal RetNet to be aware of future information. Additionally, we propose a knowledge transfer strategy, i.e., training the model with a larger chunk size and using a smaller chunk size during inference, to reduce latency while maintaining comparable accuracy. Extensive experiments have been conducted on the Human3.6M and MPI-INF-3DHP datasets, and the results demonstrate that our method achieves state-of-the-art performance.

# 209

Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

Yuanchen Ju · Kaizhe Hu · Guowei Zhang · Gu Zhang · Mingrun Jiang · Huazhe Xu

Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which helps to naturally transfer the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a resource, from which we extract an affordance memory of contact points. Inspired by the natural way humans think, We propose Robo-ABC: when confronted with unfamiliar objects that require generalization, the robot can acquire affordance by retrieving objects that share visual and semantic similarities from the memory, then mapping the contact points of the retrieved objects to the new object. While such correspondence may present formidable challenges at first glance, recent research finds it naturally arises from pre-trained diffusion models, enabling affordance mapping even across disparate categories. Through the Robo-ABC framework, robots can generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance inference by a large margin of 28.7% compared to state-of-the-art (SOTA) end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks and achieve a success rate of 85.7%, proving Robo-ABC's capacity for real-world tasks.

# 273

Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

Georgy Perevozchikov · Nancy Mehta · Mahmoud Afifi · Radu Timofte

Modern smartphone camera quality heavily relies on the image signal processor (ISP) to enhance captured raw images, utilizing carefully designed modules to produce final output images encoded in a standard color space (e.g., sRGB). Neural-based end-to-end learnable ISPs offer promising advancements, potentially replacing traditional ISPs with their ability to adapt without requiring extensive tuning for each new camera model, as is often the case for nearly every module in traditional ISPs. However, the key challenge with the recent learning-based ISPs is the urge to collect large paired datasets for each distinct camera model due to the influence of intrinsic camera characteristics on the formation of input raw images. This paper tackles this challenge by introducing a novel method for unpaired learning of raw-to-raw translation across diverse cameras. Specifically, we propose Rawformer, an unsupervised Transformer-based encoder-decoder method for raw-to-raw translation. It accurately maps raw images captured by a certain camera to the target camera, facilitating the generalization of learnable ISPs to new unseen cameras. Our method demonstrates superior performance on real camera datasets, achieving higher accuracy compared to previous state-of-the-art techniques, and preserving a more robust correlation between the original and translated raw images. Code and models will be publicly available upon acceptance.

# 265

R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding

Qirui Wu · Sonia Raychaudhuri · Daniel Ritchie · Manolis Savva · Angel X Chang

We introduce the Reality-linked 3D Scenes (R3DS) dataset of synthetic 3D scenes mirroring the real-world scene arrangements from Matterport3D panoramas. Compared to prior work, R3DS has more complete and densely populated scenes with objects linked to real-world observations in panoramas. R3DS also provides an object support hierarchy, and matching object sets (e.g., same chairs around a dining table) for each scene. Overall, R3DS contains 19K objects represented by 3,784 distinct CAD models from over 100 object categories. We demonstrate the effectiveness of R3DS on the Panoramic Scene Understanding task. We find that: 1) training on R3DS enables better generalization; 2) support relation prediction trained with R3DS improves performance compared to heuristically calculated support; and 3) R3DS offers a challenging benchmark for future work on panoramic scene understanding.

# 262

Strong Double Blind

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

Jinfeng Liu · Lingtong Kong · Bo Li · Zerong Wang · Hong Gu · Jinwei Chen

Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments on several datasets demonstrate that our method can bring significant improvements to current advanced architectures without increasing inference complexity.

# 251

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Rajeev Yasarla · Manish Kumar Singh · Hong Cai · Yunxiao Shi · Jisoo Jeong · Yinhao Zhu · Shizhong Han · Risheek Garrepalli · Fatih Porikli

In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multi-frame correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multi-frame feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models.

# 286

Möbius Transform for Mitigating Perspective Distortions in Representation Learning

Prakash Chandra Chhipa · Meenakshi Subhash Chippa · Kanjar De · Rajkumar Saini · Marcus Liwicki · Shah Mubarak

Perspective distortion (PD) causes unprecedented changes in shape, size, orientation, angles, and other spatial relationships of visual concepts in images. Precisely estimating camera intrinsic and extrinsic parameters is a challenging task that prevents synthesizing perspective distortion. Non-availability of dedicated training data poses a critical barrier to developing robust computer vision methods. Additionally, distortion correction methods make other computer vision tasks a multi-step approach and lack performance. In this work, we propose mitigating perspective distortion (MPD) by employing a fine-grained parameter control on a specific family of Möbius transform to model real-world distortion without estimating camera intrinsic and extrinsic parameters and without the need for actual distorted data. Also, we present a dedicated perspectively distorted benchmark dataset, ImageNet-PD, to benchmark the robustness of deep learning models against this new dataset. The proposed method outperforms on existing benchmarks, ImageNet-E and ImageNet-X. Additionally, it significantly improves performance on ImageNet-PD while consistently performing on standard data distribution. Further, our method shows improved performance on three PD-affected real-world applications: crowd counting, fisheye image recognition, and person re-identification. We will release source code, dataset, and models for foster further research.

# 271

Strong Double Blind

UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation

Jinho Park · Se Young Chun · Mingoo Seok

Data-driven visual-inertial odometry (VIO) has received highlights for its performance since VIOs are a crucial compartment in autonomous robots. However, their deployment on resource-constrained devices is non-trivial since large network parameters should be accommodated in the device memory. Furthermore, these networks may risk failure post-deployment due to environmental distribution shifts at test time. In light of this, we propose UL-VIO -- an ultra-lightweight (<1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency. Specifically, we perform model compression to the network while preserving the low-level encoder part, including all BatchNorm parameters for resource-efficient test-time adaptation. It achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset. For test-time adaptation, we propose to use the inertia-referred network outputs as pseudo labels and update the BatchNorm parameter for lightweight yet effective adaptation. To the best of our knowledge, this is the first work to perform noise-robust TTA on VIO. Experimental results on the KITTI, EuRoC, and Marulan datasets demonstrate the effectiveness of our resource-efficient adaptation method under diverse TTA scenarios with dynamic domain shifts.

# 287

Strong Double Blind

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

Peidong Li · Wancheng Shen · Qihao Huang · Dixiao Cui

Camera-based Bird's-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT typically employs resource-intensive Transformer to establish robust correspondences between 3D and 2D feature, while the 2D-to-3D VT utilizes the Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing distant information. To address these limitations, we propose DualBEV, a unified framework that utilizes a shared feature transformation incorporating three probabilistic measurements for both strategies. By considering dual-view correspondences in one-stage, DualBEV effectively bridges the gap between these strategies, harnessing their individual strengths. Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set. Code will be released.

# 243

HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras

Zhongyu Xia · ZhiWei Lin · Xinhao Wang · Yongtao Wang · Yun Xing · Shengxiang Qi · Nan Dong · Ming-Hsuan Yang

Three-dimensional perception from multi-view cameras is a crucial component in autonomous driving systems, which involves multiple tasks like 3D object detection and bird's-eye-view (BEV) semantic segmentation. To improve perception precision, large image encoders, high-resolution images, and long-term temporal inputs have been adopted in recent 3D perception models, bringing remarkable performance gains. However, these techniques are often incompatible in training and inference scenarios due to computational resource constraints. Besides, modern autonomous driving systems prefer to adopt an end-to-end framework for multi-task 3D perception, which can simplify the overall system architecture and reduce the implementation complexity. However, conflict between tasks often arises when optimizing multiple tasks jointly within an end-to-end 3D perception model. To alleviate these issues, we present an end-to-end framework named HENet for multi-task 3D perception in this paper. Specifically, we propose a hybrid image encoding network, using a large image encoder for short-term frames and a small image encoder for long-term temporal frames. Then, we introduce a temporal feature integration module based on the attention mechanism to fuse the features of different frames extracted by the two aforementioned hybrid image encoders. Finally, according to the characteristics of each perception task, we utilize BEV features of different grid sizes, independent BEV encoders, and task decoders for different tasks. Experimental results show that HENet achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, including 3D object detection and BEV semantic segmentation. The source code and models will be released to the public.

# 236

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

Yingqi Tang · Zhaotie Meng · Guoliang Chen · Erkang Cheng

The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which Simultaneously detects 2D objects in the Perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. We will make the code publicly available.

# 245

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Kuan-Chih Huang · Yi-Hsuan Tsai · Ming-Hsuan Yang

Weakly supervised 3D object detection aims to learn a 3D detector with lower annotation cost, e.g., 2D labels. Unlike prior work which still relies on few accurate 3D annotations, we propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels. Specifically, we employ visual data from three perspectives to establish connections between 2D and 3D domains. First, we design a feature-level constraint to align LiDAR and image features based on object-aware regions. Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations. Finally, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data. We conduct extensive experiments on the KITTI dataset to validate the effectiveness of the proposed three constraints. Without using any 3D labels, our method achieves favorable performance against state-of-the-art approaches and is competitive with the method that uses 500-frame 3D annotations. Code and models will be made publicly available.

# 246

Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection

Deepti Hegde · Suhas Lohit · Kuan-Chuan Peng · Michael J. Jones · Vishal Patel

Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.

# 250

Strong Double Blind

LiDAR-based All-weather 3D Object Detection via Prompting and Distilling 4D Radar

Yujeong Chae · HYEONSEONG KIM · Changgyoon Oh · Minseok Kim · Kuk-Jin Yoon

LiDAR-based 3D object detection models show remarkable performance, however their effectiveness diminishes in adverse weather. On the other hand, 4D radar exhibits strengths in adverse weather but faces limitations in standalone use. While fusing LiDAR and 4D radar seems to be the most intuitive approach, this method comes with limitations, including increased computational load due to radar pre-processing, situational constraints when both domain information is present, and the potential loss of sensor advantages through joint optimization. In this paper, we propose a novel LiDAR-only-based 3D object detection framework that works robustly in all-weather (normal and adverse) conditions. Specifically, we first propose 4D radar-based 3D prompt learning to inject auxiliary radar information into a LiDAR-based pre-trained 3D detection model while preserving the precise geometry capabilities of LiDAR. Subsequently, using the preceding model as a teacher, we distill weather-insensitive features and responses into a LiDAR-only student model through our four levels of inter-/intra-modal knowledge distillation. Extensive experiments demonstrate that our prompt learning effectively integrates the strengths of LiDAR and 4D radar, and our LiDAR-only student model even surpasses the detection performance of teacher and state-of-the-art models under various weather. Code will be released.

# 249

Strong Double Blind

SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather

Edoardo Palladin · Roland Dietze · Praveen Narayanan · Mario Bijelic · Felix Heide

Multimodal sensor fusion is an essential capability for autonomous robots, enabling object detection and decision-making in the presence of failing or uncertain inputs. While recent fusion methods excel in normal environmental conditions, these methods fail in adverse weather conditions, e.g., heavy fog, snow, or obstructions due to soiling. To address these challenges, we introduce a novel multi sensor fusion approach tailored for adverse weather conditions. In addition to RGB and LiDAR sensors employed in recent autonomous driving literature, our sensor fusion stack is capable of learning from NIR Gated camera and radar modalities to tackle low light and adverse weather conditions. We propose to fuse multimodal sensor data through attentive, depth-based blending schemes, with learned refinement in the Bird's Eye View (BEV) domain to combine image and range features. Our detections are predicted by a transformer decoder that weights modalities based on distance and visibility. We validate that our method improves the reliability of multimodal sensor fusion in autonomous vehicles under challenging weather conditions, bridging the gap between ideal conditions and real-world edge cases and improving average precision by 17.6 AP points to the second best method in the pedestrian class in long range dense fog conditions.

# 247

Strong Double Blind

Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception

Dingkang Yang · Ke Li · Dongling Xiao · Zedian Shao · Peng Sun · Liang Song

Collaborative perception has received widespread attention recently since it enhances the perception ability of autonomous vehicles via inter-agent information sharing. However, the performance of existing systems is hindered by the unavoidable collaboration noises, which induce feature-level spatial misalignment over the collaborator-shared information. In this paper, we propose a model-agnostic and lightweight plugin to mitigate the feature-level misalignment issue, called dynamic feature alignment (NEAT). The merits of the NEAT plugin are threefold. First, we introduce an importance-guided query proposal to predict potential foreground regions with space-channel semantics and exclude environmental redundancies. On this basis, a deformable feature alignment is presented to explicitly align the collaborator-shared features through query-aware spatial associations, aggregating multi-grained visual clues with corrective mismatch properties. Ultimately, we perform a region cross-attention reinforcement to facilitate aligned representation diffusion and achieve global feature semantic enhancement. NEAT can be readily inserted into existing collaborative perception procedures and significantly improves the robustness of vanilla baselines against pose errors and transmission delay. Extensive experiments on four collaborative 3D object detection datasets under noisy settings confirm that NEAT provides consistent gains for most methods with distinct structures.

# 241

SkyScenes: A Synthetic Dataset for Aerial Scene Understanding

Sahil Santosh Khose · Anisha Pal · Aayushi Agarwal · . Deepanshi · Judy Hoffman · Prithvijit Chattopadhyay

Real-world aerial scene understanding is limited by a lack of datasets that contain densely annotated images curated under a diverse set of conditions. Due to inherent challenges in obtaining such images in controlled real-world settings, we present SkyScenes, a synthetic dataset of densely annotated aerial images captured from Unmanned Aerial Vehicle (UAV) perspectives. We carefully curate SkyScenes images from CARLA to comprehensively capture diversity across layouts (urban and rural maps), weather conditions, times of day, pitch angles and altitudes with corresponding semantic, instance and depth annotations. Through our experiments using SkyScenes, we show that (1) models trained on SkyScenes generalize well to different real-world scenarios, (2) augmenting training on real images with SkyScenes data can improve real-world performance, (3) controlled variations in SkyScenes can offer insights into how models respond to changes in viewpoint conditions, and (4) incorporating additional sensor modalities (depth) can improve aerial scene understanding.

# 277

DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model

Li Xiaofan · Zhang Yifu · Xiaoqing Ye

With the surge in autonomous driving technologies, the reliance on comprehensive and high-definition bird's-eye-view (BEV) representations has become paramount. This burgeoning need underscores the demand for extensive multi-view video datasets, meticulously annotated to facilitate advanced research and development. Nonetheless, the acquisition of such datasets is impeded by prohibitive costs associated with data collection and annotation. There are two challenges when synthesizing multi-view videos given a 3D layout: Generating multi-view videos involves handling both view and temporal dimensions. How to generate videos while ensuring cross-view consistency and cross-frame consistency? 2) How to ensure the precision of layout control and the quality of the generated instances? Addressing this critical bottleneck, we introduce a novel spatial-temporal consistent diffusion framework, DrivingDiffusion, engineered to synthesize realistic multi-view videos governed by 3D spatial layouts. DrivingDiffusion adeptly navigates the dual challenges of maintaining cross-view and cross-frame coherence, along with meeting the exacting standards of layout fidelity and visual quality. The framework operates through a tripartite methodology: initiating with the generation of multi-view single-frame images, followed by the synthesis of single-view videos across multiple cameras, and culminating with a post-processing phase. We corroborate the efficacy of DrivingDiffusion through rigorous quantitative and qualitative evaluations, demonstrating its potential to significantly enhance autonomous driving tasks without incurring additional costs.

# 237

UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

Lan Feng · Mohammadhossein Bahari · Kaouther Messaoud · Eloi Zablocki · MATTHIEU CORD · Alexandre ALahi

Vehicle trajectory prediction has increasingly relied on data-driven solutions, but their ability to scale to different data domains and the impact of larger dataset sizes on their generalization remain under-explored. While these questions can be studied by employing multiple datasets, it is challenging due to several discrepancies, e.g., in data formats, map resolution, and semantic annotation types. To address these challenges, we introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria, presenting new opportunities for the vehicle trajectory prediction field. In particular, using UniTraj, we conduct extensive experiments and find that model performance significantly drops when transferred to other datasets. However, enlarging data size and diversity can substantially improve performance, leading to a new state-of-the-art result for the nuScenes dataset. We provide insights into dataset characteristics to explain these findings. We will release the framework to support further research.

# 278

Strong Double Blind

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

Yibo Liu · Zheyuan Yang · Guile Wu · Yuan REN · Kejian Lin · Liu Bingbing · Yang Liu · JINJUN SHAN

Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc). This leads to their poor zero-shot prediction capability to handle real-world observations with occlusion or tricky viewing angles. To solve this problem, in this work, we propose VQA-Diff, a novel framework that leverages in-the-wild vehicle images to create photorealistic 3D vehicle assets for autonomous driving. VQA-Diff exploits the real-world knowledge inherited from the Large Language Model in the Visual Question Answering (VQA) model for robust zero-shot prediction and the rich image prior knowledge in Diffusion Models for structure and appearance generation. In particular, we utilize a multi-expert Diffusion Models strategy to generate the structure information and employ a subject-driven structure-controlled generation mechanism to model appearance information. As a result, without the necessity to learn from a large-scale image-to-3D vehicle dataset collected from the real world, VQA-Diff still has a robust zero-shot image-to-novel-view generation ability. We conduct experiments on various datasets, including Pascal 3D+, Waymo, and Objaverse, to demonstrate that VQA-Diff outperforms existing state-of-the-art methods both qualitatively and quantitatively.

# 239

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

Guoqing Wang · Zhongdao Wang · Pin Tang · Jilai Zheng · Xiangxuan Ren · Bailan Feng · Chao Ma

Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere. In this paper, we introduce OccGen, a simple yet powerful generative perception model for the task of 3D semantic occupancy prediction. OccGen adopts a ``noise-to-occupancy'' generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions. The code will be released.

# 272

Stream Query Denoising for Vectorized HD-Map Construction

Shuo Wang · Fan Jia · Weixin Mao · Yingfei Liu · Yucheng Zhao · Zehui Chen · Tiancai Wang · Chi Zhang · Xiangyu Zhang · Feng Zhao

This paper introduces the Stream Query Denoising (SQD) strategy, a novel and general approach for high-definition map (HD-map) construction. SQD is designed to improve the modeling capability of map elements by learning temporal consistency. Specifically, SQD involves the process of denoising the queries, which are generated by the noised ground truth of the previous frame. This process aims to reconstruct the ground truth of the current frame during training. Our method can be applied to both static and temporal methods, showing the great effectiveness of SQD strategy. Extensive experiments on nuScenes and Argoverse2 show that our framework achieves superior performance, compared to other existing methods across all settings.

# 235

Strong Double Blind

Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention

Xunjiang Gu · Guanyu Song · Igor Gilitschenski · Marco Pavone · Boris Ivanovic

Understanding road geometry is a critical component of the autonomous vehicle (AV) stack. While high-definition (HD) maps can readily provide such information, they suffer from high labeling and maintenance costs. Accordingly, many recent works have proposed methods for estimating HD maps online from sensor data. The vast majority of recent approaches encode multi-camera observations into an intermediate representation, e.g., a bird's eye view (BEV) grid, and produce vector map elements via a decoder. While this architecture is performant, it decimates much of the information encoded in the intermediate representation, preventing downstream tasks (e.g., behavior prediction) from leveraging them. In this work, we propose exposing the rich internal features of online map estimation methods and show how they enable more tightly integrating online mapping with trajectory forecasting. In doing so, we find that directly accessing internal BEV features yields up to 73% faster inference speeds and up to 29% more accurate predictions on the real-world nuScenes dataset.

# 234

Strong Double Blind

Early Anticipation of Driving Maneuvers

Abdul Wasi Lone · Shankar Gangisetty · Shyam Nandan · C. V. Jawahar

Prior works have addressed the problem of driver intention prediction (DIP) by identifying maneuvers after their onset. On the other hand, early anticipation is equally important in scenarios that demand a preemptive response before a maneuver begins. However, there is no prior work aimed at addressing the problem of driver action anticipation before the onset of the maneuver, limiting the ability of the advanced driver assistance system (ADAS) for early maneuver anticipation. In this work, we introduce Anticipating Driving Maneuvers (ADM), a new task that enables driver action anticipation before the onset of the maneuver. To initiate research in ADM task, we curate Driving Action Anticipation Dataset, DAAD, that is multi-view: in- and out-cabin views in dense and heterogeneous scenarios, and multimodal: egocentric view and gaze information. The dataset captures sequences both before the initiation and during the execution of a maneuver. During dataset collection, we also ensure to capture wide diversity in traffic scenarios, weather and illumination, and driveway conditions. Next, we propose a strong baseline based on a transformer architecture to effectively model multiple views and modalities over longer video lengths. We benchmark the existing DIP methods on DAAD and related datasets. Finally, we perform an ablation study showing the effectiveness of multiple views and modalities in maneuver anticipation. Project Page: https://cvit.iiit.ac.in/research/projects/cvit-projects/daad.

# 231

Adaptive Human Trajectory Prediction via Latent Corridors

Neerja Thakkar · Karttikeya Mangalam · Andrea Bajcsy · Jitendra Malik

Human trajectory prediction is typically posed as a zero-shot generalization problem: a predictor is learnt on a dataset of human motion in training scenes, and then deployed on unseen test scenes. While this paradigm has yielded tremendous progress, it fundamentally assumes that trends in human behavior within the deployment scene are constant over time. As such, current prediction models are unable to adapt to scene-specific transient human behaviors, such as crowds temporarily gathering to see buskers, pedestrians hurrying through the rain and avoiding puddles, or a protest breaking out. We formalize the problem of scene-specific adaptive trajectory prediction and propose a new adaptation approach inspired by prompt tuning called latent corridors. By augmenting the input of any pre-trained human trajectory predictor with learnable image prompts, the predictor can improve in the deployment scene by inferring trends from extremely small amounts of new data (e.g., 2 humans observed for 30 seconds). With less than 0.1% additional model parameters, we see up to 23.9% ADE improvement in MOTSynth simulated data and 16.4% ADE in MOT and Wildtrack real pedestrian data. Qualitatively, we observe that latent corridors imbue predictors with an awareness of scene geometry and scene-specific human behaviors that non-adaptive predictors struggle to capture.

# 232

Strong Double Blind

Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model

Guanren Qiao · Guiliang Liu · Guorui Quan · Rongxiao Qu

Modeling the trajectories of intelligent vehicles is an essential component of a traffic-simulating system. However, such trajectory predictors are typically trained to imitate the movements of human drivers. The imitation models often fall short of capturing safety-critical events residing in the long-tail end of the data distribution, especially under complex environments involving multiple drivers. In this paper, we propose a game-theoretic perspective to resolve this challenge by modeling the competitive interactions of vehicles in a general-sum Markov game and characterizing these safety-critical events with the correlated equilibrium. To achieve this goal, we pretrain a generative world model to predict the environmental dynamics of self-driving scenarios. Based on this world model, we probe the action predictor for identifying the Coarse Correlated Equilibrium (CCE) by incorporating both optimistic Bellman update and magnetic mirror descent into the objective function of the Multi-Agent Reinforcement Learning (MARL) algorithm. We conduct extensive experiments to demonstrate our algorithm outperforms other baselines in terms of efficiently closing the CCE-gap and generating meaningful trajectories under competitive autonomous driving environments.

# 230

Probabilistic Weather Forecasting with Deterministic Guidance-based Diffusion Model

Donggeun Yoon · Minseok Seo · Doyi Kim · Yeji Choi · Donghyeon Cho

Weather forecasting requires both deterministic outcomes for immediate decision-making and probabilistic results for assessing uncertainties. However, deterministic models may not fully capture the spectrum of weather possibilities, and probabilistic forecasting can lack the precision needed for specific planning, presenting significant challenges as the field aims for enhance accuracy and reliability. In this paper, we propose the Deterministic Guidance-based Diffusion Model (DGDM) to exploit the benefits of both deterministic and probabilistic weather forecasting models. DGDM integrates a deterministic branch and a diffusion model as a probabilistic branch to improve forecasting accuracy while providing probabilistic forecasting. In addition, we introduce a sequential variance schedule that predicts from the near future to the distant future. Moreover, we present a truncated diffusion by using the result of the deterministic branch to truncate the reverse process of the diffusion model to control uncertainties. We conduct extensive analyses of DGDM on the Moving MNIST. Furthermore, we evaluate the effectiveness of DGDM on the Pacific Northwest Windstorm (PNW)-Typhoon satellite dataset for regional extreme weather forecasting, as well as on the WeatherBench dataset for global weather forecasting dataset. Experimental results show that DGDM achieves state-of-the-art performance not only in global forecasting but also in regional forecasting scenarios. The code is available at: https://github.com/DongGeun-Yoon/DGDM.

# 229

Strong Double Blind

Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation

Friedhelm Hamann · Ziyun Wang · Ioannis Asmanis · Kenneth Chaney · Guillermo Gallego · Kostas Daniilidis

Current optical flow and point-tracking methods rely heavily on synthetic datasets. Event cameras are novel vision sensors with advantages in challenging visual conditions, but state-of-the-art frame-based methods cannot be easily adapted to event data due to the limitations of current event simulators. We introduce a novel self-supervised loss combining the Contrast Maximization framework with a non-linear motion prior in the form of pixel-level trajectories and propose an efficient solution to solve the high-dimensional assignment problem between non-linear trajectories and events. Their effectiveness is demonstrated in two scenarios: In dense continuous-time motion estimation, our method improves the zero-shot performance of a synthetically trained model on the real-world dataset EVIMO2 by 29%. In optical flow estimation, our method elevates a simple UNet to achieve state-of-the-art performance among self-supervised methods on the DSEC optical flow benchmark.

# 227

Strong Double Blind

Temporal Event Stereo via Joint Learning with Stereoscopic Flow

Hoonhee Cho · Jae-young Kang · Kuk-Jin Yoon

Event cameras are dynamic vision sensors inspired by the biological retina, characterized by their high dynamic range, high temporal resolution, and low power consumption. These features make them capable of perceiving 3D environments even in extreme conditions. Event data is continuous across the time dimension, which allows a detailed description of each pixel's movements. To fully utilize the temporally dense and continuous nature of event cameras, we propose a novel temporal event stereo, a framework that continuously uses information from previous time steps. This is accomplished through the simultaneous training of an event stereo matching network alongside stereoscopic flow, a new concept that captures all pixel movements from stereo cameras. Since obtaining ground truth for optical flow during training is challenging, we propose a method that uses only disparity maps to train the stereoscopic flow. Ultimately, we enhance the performance of event-based stereo matching by temporally aggregating information using the flows. We have achieved state-of-the-art performance on the MVSEC and the DSEC dataset. Our method is computationally efficient as it stacks previous information in a cascading manner.

# 223

Strong Double Blind

FARSE-CNN: Fully Asynchronous, Recurrent and Sparse Event-Based CNN

Riccardo Santambrogio · Marco Cannici · Matteo Matteucci

Event cameras are neuromorphic image sensors that respond to per-pixel brightness changes, producing a stream of asynchronous and spatially sparse events. Currently, the most successful algorithms for event cameras convert batches of events into dense image-like representations that are synchronously processed by deep learning models of frame-based computer vision. These methods discard the inherent properties of events, leading to high latency and computational costs. Following a recent line of works, we propose a model for efficient asynchronous event processing that exploits sparsity. We design the Fully Asynchronous, Recurrent and Sparse Event-Based CNN (FARSE-CNN), a novel multi-layered architecture which combines the mechanisms of recurrent and convolutional neural networks. To build efficient deep networks, we propose compression modules that allow to learn hierarchical features both in space and time. We theoretically derive the complexity of all components in our architecture, and experimentally validate our method on tasks for object recognition, object detection and gesture recognition. FARSE-CNN achieves similar or better performance than the state-of-the-art among asynchronous methods, with low computational complexity and without relying on a fixed-length history of events. Our code will be released on GitHub.

# 226

Strong Double Blind

Event-Adapted Video Super-Resolution

Zeyu Xiao · Dachun Kai · Yueyi Zhang · Zheng-Jun Zha · Xiaoyan Sun · Zhiwei Xiong

Introducing event cameras into video super-resolution (VSR) shows great promise. In practice, however, integrating event data as a new modality necessitates a laborious model architecture design. This not only consumes substantial time and effort but also disregards valuable insights from successful existing VSR models. Furthermore, the resource-intensive process of retraining these newly designed structures exacerbates the challenge. In this paper, inspired by recent success of parameter-efficient tuning in reducing the number of trainable parameters of a pre-trained model for downstream tasks, we introduce the Event AdapTER (EATER) for VSR. EATER efficiently utilizes pre-trained VSR model knowledge at the feature level through two lightweight and trainable components: the event-adapted alignment (EAA) unit and the event-adapted fusion (EAF) unit. The EAA unit aligns multiple frames based on the event stream in a coarse-to-fine manner, while the EAF unit efficiently fuses frames with the event stream through a multi-scaled design. Thanks to both units, EATER outperforms the full fine-tuning paradigm. Comprehensive experiments demonstrate the effectiveness of EATER, achieving superior results with parameter efficiency.

# 233

Strong Double Blind

Diffusion Models as Optimizers for Efficient Planning in Offline RL

Renming Huang · Yunqiang Pei · Guoqing Wang · Yangming Zhang · Yang Yang · Peng Wang · Heng Tao Shen

Diffusion models have shown strong competitiveness in offline reinforcement learning tasks by formulating decision-making as sequential generation. However, the practicality of these methods is limited due to the lengthy inference processes they require. In this paper, we address this problem by decomposing the sampling process of diffusion models into two decoupled subprocesses: 1) generating a feasible trajectory, which is a time-consuming process, and 2) optimizing the trajectory. With this decomposition approach, we are able to partially separate efficiency and quality factors, enabling us to simultaneously gain efficiency advantages and ensure quality assurance. We propose the Trajectory Diffuser, which utilizes a faster autoregressive model to handle the generation of feasible trajectories while retaining the trajectory optimization process of diffusion models. This allows us to achieve more efficient planning without sacrificing capability. To evaluate the effectiveness and efficiency of the Trajectory Diffuser, we conduct experiments on the D4RL benchmarks. The results demonstrate that our method achieves $\it 3$-$\it 10 \times$ faster inference speed compared to previous sequence modeling methods, while also outperforming them in terms of overall performance. Codes will be publicly available.

# 252

Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Chaoyue Xing · Wei Mao · Miaomiao LIU

In this paper, we tackle the problem of scene-aware 3D human motion forecasting. A key challenge of this task is to predict future human motions that are consistent with the scene by modeling the human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest of the motion unconstrained. To address this limitation, we propose to model the human-scene interaction with the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface and the distance of basis scene points to the human mesh. We further introduce a global scene representation learned from a signed distance function (SDF) volume to ensure coherence between the global scene representation and the explicit constraint from the mutual distance. We develop a pipeline with two sequential steps: predicting the future mutual distances first, followed by forecasting future human motion. During training, we explicitly encourage consistency between predicted poses and mutual distances. Extensive evaluations on the existing synthetic and real datasets demonstrate that our approach consistently outperforms the state-of-the-art methods.

# 242

CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

Jiarui Sun · Girish Chowdhary

Stochastic Human Motion Prediction (HMP) aims to predict multiple possible future pose sequences from observed ones. Most prior works learn motion distributions through encoding-decoding in latent space, which does not preserve motion’s spatial-temporal structure. While effective, these methods often require complex, multi-stage training and yield predictions that are inconsistent with the provided history and can be physically unrealistic. To address these issues, we propose CoMusion, a single-stage, end-to-end diffusion-based stochastic HMP framework. CoMusion is inspired from the insight that a smooth future pose initialization improves prediction performance, a strategy not previously utilized in stochastic models but evidenced in deterministic works. To generate such initialization, CoMusion's motion predictor starts with a Transformer-based network for initial reconstruction of corrupted motion. Then, a graph convolutional network (GCN) is employed to refine the prediction considering past observations in the discrete cosine transformation (DCT) space. Our method, facilitated by the Transformer-GCN module design and a proposed variance scheduler, excels in predicting accurate, realistic, and consistent motion, while maintaining an appropriate level of diversity. Experimental results on benchmark datasets demonstrate that CoMusion surpasses prior methods in both accuracy and fidelity, achieving at least a 35% relative improvement in fidelity metrics, while demonstrating superior robustness. Our method, facilitated by the Transformer-GCN module design and a proposed variance scheduler, excels in predicting accurate, realistic, and consistent motions, while maintaining an appropriate level of diversity. Experimental results on benchmark datasets demonstrate that CoMusion surpasses prior methods in both accuracy and fidelity, achieving at least a 35% relative improvement in fidelity metrics, while demonstrating superior robustness.

# 211

Strong Double Blind

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Jie Yang · Xuesong Niu · Nan Jiang · Ruimao Zhang · Siyuan Huang

Existing 3D human object interaction (HOI) datasets and models simply align global descriptions with the long HOI sequence, while lacking a detailed understanding of intermediate states and the transitions between states. In this paper, we argue that fine-grained semantic alignment, which utilizes state-level descriptions, offers a promising paradigm for learning semantically rich HOI representations. To achieve this, we introduce Semantic-HOI, a new dataset comprising over 20K paired HOI states with fine-grained descriptions for each HOI state and the body movements that happen between two consecutive states. Leveraging the proposed dataset, we design three state-level HOI tasks to accomplish fine-grained semantic alignment within the HOI sequence. Additionally, we propose a unified model called \ModelName, designed to leverage multimodal instructions and empower the Multi-modal Large Language Model to efficiently handle diverse HOI tasks. F-HOI offers multiple advantages: (1) It employs a unified task formulation that supports the use of versatile multimodal inputs. (2) It maintains consistency in HOI across 2D, 3D, and linguistic spaces. (3) It utilizes fine-grained textual supervision for direct optimization, avoiding intricate modeling of HOI states. Extensive experiments reveal that \ModelName effectively aligns HOI states with fine-grained semantic descriptions, adeptly tackling understanding, reasoning, generation, and reconstruction tasks.

# 213

Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases

Xinpeng Liu · Yong-Lu Li · AILING ZENG · Zizheng Zhou · Yang You · Cewu Lu

The goal of motion understanding is to establish a reliable mapping between motion and action semantics, while it is a challenging many-to-many problem. An abstract action semantic (i.e., walk forwards) could be conveyed by perceptually diverse motions (walking with arms up or swinging), while a motion could carry different semantics w.r.t. its context and intention. This makes an elegant mapping between them difficult. Previous attempts adopted direct-mapping paradigms with limited reliability. Also, current automatic metrics fail to provide reliable assessments of the consistency between motions and action semantics. We identify the source of these problems as the significant gap between the two modalities. To alleviate this gap, we propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality. Based on KP, we can unify a motion knowledge base and build a motion understanding system. Meanwhile, KP can be automatically converted from motions to text descriptions with no subjective bias, inspiring Kinematic Prompt Generation (KPG) as a novel white-box motion generation benchmark. In extensive experiments, our approach shows superiority over other methods. Our code and data would be made publicly available.

# 216

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

Yiming Huang · WEILIN WAN · Yue Yang · Chris Callison-Burch · Mark Yatskar · Lingjie Liu

Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as "left knee slightly bent". Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities.

# 215

Strong Double Blind

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Peng Jin · Hao Li · Zesen Cheng · Kehan Li · Runyi Yu · Chang Liu · Xiangyang Ji · Li Yuan · Jie Chen

Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community.

# 343

Strong Double Blind

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra · Richa Mishra · Ziyi Chen · Boyang Ding · Renda Li · Shoulei Wang · Jun-Yan Zhu · Peng Chang · Mei Han · Jing Xiao

Co-speech gesture video generation is an enabling technique for numerous digital human applications in the post-ChatGPT era. Substantial progress has been made in creating high-quality talking head videos. However, existing hand gesture video generation methods are largely limited by the widely adopted 2D skeleton-based gesture representation, and still struggle to generate realistic hands. We propose a novel end-to-end audio-driven co-speech video generation pipeline to synthesize human speech videos leveraging 3D human mesh-based representations. By adopting a 3D human mesh-based gesture representation, we present a mesh-grounded video generator that includes a mesh texture-map optimization step followed by a new conditional GAN-based network, and outputs photorealistic gesture videos with realistic hands. Our experiments on the TalkSHOW dataset demonstrate the effectiveness of our method over a baseline that uses 2D skeleton-based representation.

# 281

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Muyao Niu · Xiaodong Cun · Xintao Wang · Yong Zhang · Ying Shan · zheng yinqiang

We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals~(such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters~(\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation.

# 183

MEVG : Multi-event Video Generation with Text-to-Video Models

Gyeongrok Oh · Jaehwan Jeong · Sieun Kim · Wonmin Byeon · Jinkyu Kim · Sungwoong Kim · Sangpil Kim

We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained diffusion-based text-to-video generative model without a fine-tuning process. Specifically, we propose a last frame-aware diffusion process to preserve visual coherence between consecutive videos where each video consists of different events by initializing the latent and simultaneously adjusting noise in the latent to enhance the motion dynamic in a generated video. Furthermore, we find that the iterative update of latent vectors by referring to all the preceding frames maintains the global appearance across the frames in a video clip. To handle dynamic text input for video generation, we utilize a novel prompt generator that transfers course text messages from the user into the multiple optimal prompts for the text-to-video diffusion model. Extensive experiments and user studies show that our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.

# 98

Strong Double Blind

HARIVO: Harnessing Text-to-Image Models for Video Generation

Mingi Kwon · Seoung Wug Oh · Yang Zhou · Joon-Young Lee · Difan Liu · Haoran Cai · Baqiao Liu · Feng Liu · Youngjung Uh

We present a method to create diffusion-based Video models from pretrained Text-to-Image (T2I) models, overcoming limitations of existing methods. We propose a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. We demonstrate superior performance through extensive experiments and comparisons.

# 83

Strong Double Blind

WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing

Yutang Feng · Sicheng Gao · Yuxiang Bao · Xiaodi Wang · Shumin Han · Juan Zhang · Baochang Zhang · Angela Yao

Text-driven video editing has emerged as a prominent application based on the breakthroughs of image diffusion models. Existing state-of-the-art methods focus on zero-shot frameworks due to limited training data and to limit computational expense. To preserve structure consistency, previous frameworks usually employ Denoising Diffusion Implicit Model (DDIM) inversion to provide inverted noise latents as guidance. The key challenge lies in limiting errors caused by the randomness and inaccuracy in each step of the in the naive DDIM inversion process, which can lead to temporal inconsistency in video editing tasks. Our observation indicates that incorporating temporal keyframe information can alleviate the accumulated error during inversion. In this paper, we propose an effective warping strategy in the feature domain to obtain high-quality DDIM inverted noise latents. Specifically, we shuffle the editing frames randomly in each timestep and use optical flow extracted from the source video to propagate the latent features of the first keyframe to subsequent keyframes. Moreover, we develop a comprehensive zero-shot framework that adapts to this strategy in both the inversion and denoising processes, thereby facilitating the generation of consistent edited videos.

# 85

Strong Double Blind

RegionDrag: Fast Region-Based Image Editing with Diffusion Models

Jingyi Lu · Xinghui Li · Kai Han

Point-drag-based image editing methods, like DragDiffusion, have attracted significant attention. However, point-drag-based approaches suffer from computational overhead and misinterpretation of user intentions, due to the sparsity of point-based editing instructions. In this paper, we propose a region-based copy-and-paste dragging method, RegionDrag, to overcome these limitations. RegionDrag allows users to express their editing instructions in the form of handle and target regions, enabling more precise control and alleviating ambiguity. In addition, region-based operations complete editing in one iteration and are much faster than point-drag-based methods. We also incorporate the attention-swapping technique for enhanced stability during editing. To validate our approach, we extend existing point-drag-based datasets with region-based dragging instructions. Experimental results demonstrate that RegionDrag outperforms existing point-drag-based approaches in terms of speed, accuracy, and alignment with user intentions. Remarkably, RegionDrag completes the edit on an image with a resolution of 512x512 in less than 2 seconds, which is more than 100x faster than DragDiffusion, while achieving better performance. Project page: \url{https://visual-ai.github.io/regiondrag}.

# 86

Strong Double Blind

TurboEdit: Real-time text-based disentangled real image editing

Zongze Wu · Nicholas I Kolkin · Jonathan Brandt · Richard Zhang · Eli Shechtman

This paper presents a novel approach for real-time image editing leveraging few-shot diffusion models. We demonstrate that disentangled controls can be easily achieved in the few-shot diffusion model by conditioning on a detailed text prompt. Our method involves generating a source image by fixing the random seed and utilizing a lengthy text prompt, followed by modifying one attribute in the text prompt to regenerate the target image. We observe that the source and target images are nearly identical, differing only in the modified attribute. Additionally, we introduce an iterative image inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for the correction of the reconstructed image towards the input image. The information of the input image is preserved in the detailed text prompt and four levels of noise maps. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt, resulting in the generation of a new image similar to the input image with only one attribute changed. Furthermore, our method achieves real-time performance, running in milliseconds for both the inversion and editing processes.

# 78

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

Daniel Geng · Inbum Park · Andrew Owens

Given a factorization of an image into various components, we present a method to independently control these components through diffusion model sampling. For example, decomposing an image into low and high spatial frequencies allows us to produce images whose low frequencies align with one prompt, and whose high frequencies align with another prompt. That is, we are able to produce hybrid images. We also explore a decomposition into {\it Lab} color space, allowing us to produce images that appear to be one thing when viewed in greyscale, but changes appearance when color is added back. Our method is simple and only modifies the sampling procedure of a pretrained text-conditional image diffusion model. It works by denoising with a composite noise estimate, where each component of the estimate comes from a noise estimate conditioned on a different prompt. We provide qualitative results showing that this method is effective, give intuition for why this approach succeeds, and derive conditions on the image decomposition for the method to work. In addition, we provide quantitative evaluations demonstrating that our method is better than prior work on hybrid image generation, and we generate hybrid images with three different contents.

# 84

Strong Double Blind

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

KONSTANTINA NIKOLAIDOU · George Retsinas · Giorgos Sfikas · Marcus Liwicki

Handwritten text generation (HTG) conditioned on content and style is a challenging task due to the variability of inter-user characteristics and the unlimited combinations of characters that form new words unseen during training. Diffusion Models have recently shown promising results in HTG, however they are still under-explored. We present DiffusionPen (DiffPen), a 5-shot style handwritten text generation approach based on Latent Diffusion Models. By utilizing a hybrid style extractor that combines the power of metric learning and classification, our approach manages to capture both textual and stylistic characteristics of seen and unseen words and styles and generate realistic handwritten samples. We perform experiments on IAM offline handwriting database to evaluate the generated style and content and compare them with other SotA methods. Our method outperforms other methods qualitatively and quantitatively and additional data from our method can improve the performance of a Handwriting Text Recognition (HTR) system. The code is available at: (code repository will be released in case of acceptance to maintain anonymity).

# 80

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Viraj Shah · Nataniel Ruiz · Forrester Cole · Erika Lu · Svetlana Lazebnik · Yuanzhen Li · Varun Jampani

Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem, so that either the subject-fidelity or style-fidelity are compromised. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize.

# 76

Strong Double Blind

Scaling Up Personalized Image Aesthetic Assessment via Task Vector Customization

Jooyeol Yun · Choo Jaegul

The task of personalized image aesthetic assessment seeks to tailor aesthetic score prediction models to match individual preferences with just a few user-provided inputs. However, the scalability and generalization capabilities of current approaches are considerably restricted by their reliance on an expensive curated database. To overcome this long-standing scalability challenge, we present a unique approach that leverages readily available databases for general image aesthetic assessment and image quality assessment. Specifically, we view each database as a distinct image score regression task that exhibits varying degrees of personalization potential. By determining optimal combinations of task vectors, known to represent specific traits of each database, we successfully create personalized models for individuals. This approach of integrating multiple models allows us to harness a substantial amount of data. Our extensive experiments demonstrate the effectiveness of our approach in generalizing to previously unseen domains---a challenge previous approaches have struggled to achieve---making it highly applicable to real-world scenarios. Our novel approach significantly advances the field by offering scalable solutions for personalized aesthetic assessment and establishing high standards for future research.

# 92

FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

Xinzhi MU · Li Chen · Bohan CHEN · Shuyang Gu · Jianmin Bao · DONG CHEN · Ji Li · Yuhui Yuan

Recently, the application of modern diffusion-based text-to-image generation models for creating artistic fonts, traditionally the domain of professional designers, has garnered significant interest. Diverging from the majority of existing studies that concentrate on generating artistic typography, our research aims to tackle a novel and more demanding challenge: the generation of text effects for multilingual fonts. This task essentially requires generating coherent and consistent visual content within the confines of a font-shaped canvas, as opposed to a traditional rectangular canvas. To address this task, we introduce a novel shape-adaptive diffusion model capable of interpreting the given shape and strategically planning pixel distributions within the irregular canvas. To achieve this, we curate a high-quality shape-adaptive image-text dataset and incorporate the segmentation mask as a visual condition to steer the image generation process within the irregular-canvas. This approach enables the traditionally rectangle canvas-based diffusion model to produce the desired concepts in accordance with the provided geometric shapes. Second, to maintain consistency across multiple letters, we also present a training-free, shape-adaptive effect transfer method for transferring textures from a generated reference letter to others. The key insights are building a font effect noise prior and propagating the font effect information in a concatenated latent space. The efficacy of our \ourname system is confirmed through user preference studies, which show a marked preference (78\% win-rates on aesthetics) for our system even when compared to the latest unrivaled commercial product, Adobe Firefly

# 94

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

Sun Yanan · Yanchen Liu · Yinhao Tang · Wenjie Pei · Kai Chen

The field of text-to-image (T2I) generation has made significant progress in recent years, thanks to diffusion models. Linguistic control enables effective content creation, but is defective in fine-grained control over image generation. This challenge has been solved in great extent by incorporating additional user-supplied spatial conditions like depth map, edge map into pre-trained T2I model via extra encoding. However, multi-control image synthesis still struggle with input flexibility, handling the relationship among spatial conditions, and maintaining compatibility with text inputs. To address these challenges, we propose AnyControl, a controllable image synthesis framework that supports any combination of various forms of control signals. AnyControl develops a novel multi-control encoder to extract a unified multi-modal embedding for diverse control signals used for guiding the generation process. We achieve this by employing alternate multi-control encoding scheme and multi-control alignment scheme, with learnable queries as a bridge to unite them seamlessly and gradually distill compatible information from spatial conditions guided by textual prompts. This approach enables holistic understanding of user inputs, and produces harmonious results in high quality and fidelity under versatile control signals, demonstrated by extensive quantitative and qualitative results.

# 99

Strong Double Blind

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Jiaqi Liu · Tao Huang · Chang Xu

Recent breakthroughs in text-to-image diffusion models have significantly advanced the generation of high-fidelity, photo-realistic images from textual descriptions. Yet, these models often struggle with interpreting spatial arrangements from text, hindering their ability to produce images with precise spatial configurations. To bridge this gap, layout-to-image generation has emerged as a promising direction. However, training-based approaches are limited by the need for extensively annotated datasets, leading to high data acquisition costs and a constrained conceptual scope. Conversely, training-free methods face challenges in accurately locating and generating semantically similar objects within complex compositions. This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. By refining intra-token loss with selective sampling and enhancing the diffusion process with attention redistribution, we propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships, enhancing the fidelity and complexity of generated images. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.

# 90

Strong Double Blind

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Fabio Quattrini · Vittorio Pippi · Silvia Cascianelli · Rita Cucchiara

Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence.

# 91

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Yasi Zhang · Peiyu Yu · Ying Nian Wu

Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a $z$-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models.

# 89

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Omer Dahary · Or Patashnik · Kfir Aberman · Danny Cohen-Or

Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model’s attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts. Our code will be available upon publication.

# 81

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

Kong Zhe · Yong Zhang · Tianyu Yang · Tao Wang · Kaihao Zhang · Bizhu Wu · Guanying Chen · Wei Liu · Wenhan Luo

Personalization is an important topic in text-to-image generation, especially the challenging multi-concept personalization. Current multi-concept methods are struggling with identity preservation, occlusion, and the harmony between foreground and background. In this work, we propose OMG, an occlusion-friendly personalized generation framework designed to seamlessly integrate multiple concepts within a single image. We propose a novel two-stage sampling solution. The first stage takes charge of layout generation and visual comprehension information collection for handling occlusions. The second one utilizes the acquired visual comprehension information and the designed noise blending to integrate multiple concepts while considering occlusions. We also observe that the initial denoising timestep for noise blending is the key to identity preservation and layout. Moreover, our method can be combined with various single-concept models, such as LoRA and InstantID without additional tuning. Especially, LoRA models on civitai.com can be exploited directly. Extensive experiments demonstrate that OMG exhibits superior performance in multi-concept personalization.

# 101

Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

Yingshan Chang · Yasi Zhang · Zhiyuan Fang · Ying Nian Wu · Yonatan Bisk · Feng Gao

The literature on text-to-image generation is plagued by issues of faithfully composing entities with relations. But there lacks a formal understanding of how entity-relation compositions can be effectively learned. Moreover, the underlying phenomenon space that meaningfully reflects the problem structure is not well-defined, leading to an arms race for larger quantities of data in the hope that generalization emerges out of large scale pretraining. We hypothesize that the underlying phenomenological coverage has not been proportionally scaled up, leading to a skew of the presented phenomenon which harms generalization. We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning, and show that generalization failures of text-to-image generation is a direct result of an incomplete or unbalanceed phenomenological coverage. We first perform experiments in a synthetic domain and demonstrate that systematically controlled metrics are strongly predictive of generalization performance. Then we move to natural images and show that simple distribution perturbations in light of our theories boost generalization without enlarging the absolute data size. This work informs an important direction towards quality-enhancing the data diversity or balance orthogonal to scaling up the absolute size. Our discussions point out important open questions on 1) Evaluation of generated entity-relation compositions, 2) Better models for reasoning with abstract relations.

# 95

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Bo-Kyeong Kim · Hyoung-Kyu Song · Thibault Castells · Shinkook Choi

Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high computing demands due to billion-scale parameters. To enhance efficiency, recent studies have reduced sampling steps and applied network quantization while retaining the original architectures. The lack of architectural reduction attempts may stem from worries over expensive retraining for such massive models. In this work, we uncover the surprising potential of block pruning and feature distillation for low-cost general-purpose T2I. By removing several residual and attention blocks from the U-Net of SDMs, we achieve 30%~50% reduction in model size, MACs, and latency. We show that distillation retraining is effective even under limited resources: using only 13 A100 days and a tiny dataset, our compact models can imitate the original SDMs (v1.4 and v2.1-base with over 6,000 A100 days). Benefiting from the transferred knowledge, our BK-SDMs deliver competitive results on zero-shot MS-COCO against larger multi-billion parameter models. We further demonstrate the applicability of our lightweight backbones in personalized generation and image-to-image translation. Deployment of our models on edge devices attains 4-second inference. We hope this work can help build small yet powerful diffusion models with feasible training budgets.

# 97

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Xiaoshi Wu · Yiming Hao · Manyuan Zhang · Keqiang Sun · Zhaoyang Huang · Guanglu Song · Yu Liu · Hongsheng LI

Optimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. We find that training earlier steps in the sampling process is crucial for low-level rewards, and deep supervision can be achieved efficiently and effectively by stopping the gradient of the denoising network input. DRTune is extensively evaluated on various reward models. It consistently outperforms other algorithms, particularly for low-level control signals, where all shallow supervision methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0) model via DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances image quality compared to SDXL 1.0 and reaches comparable quality compared with Midjourney v5.2.

# 60

Strong Double Blind

MONTRAGE: Monitoring Training for Attribution of Generative Diffusion Models

Jonathan Brokman · Omer Hofman · Roman Vainshtein · Amit Giloni · Toshiya Shimizu · Inderjeet Singh · Oren Rachmil · Alon Zolfi · Asaf Shabtai · Yuki Unno · Hisashi Kojima

Diffusion models, which revolutionized image generation, are facing challenges related to intellectual property. These challenges arise when a generated image is influenced by one or more copyrighted images from the training data. Hence, pinpointing influential images from the training dataset, a task known as data attribution, becomes crucial for the clarity of content origins. We introduce MONTAGE, a pioneering data attribution method. Unlike existing approaches that overlook the internal workings of the training process, MONTAGE integrates a novel technique to monitor generations throughout the training via internal model representations. It is tailored for customized diffusion models, where training access is a practical assumption. This approach, coupled with a new loss function, enables enhanced accuracy as well as granularity of the attributions. The advantage of MONTAGE is evaluated in two granularity levels: Semantic concept (including mix-concept images) and individual image, showing promising results. This underlines MONTAGE's role towards solving copyright concerns in AI-generated digital art and media while enriching the understanding of the generative process.

# 96

ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation

Yi Zhang · Yun Tang · Wenjie Ruan · Xiaowei Huang · Siddartha Khastgir · Paul A Jennings · Xingyu Zhao

Text-to-Image (T2I) Diffusion Models (DMs) excel at creating high-quality images from text descriptions but, like many deep learning models, suffer from robustness issues. While there are attempts to evaluate the robustness of T2I DMs as a binary or worst-case problem, they cannot answer how robust in general the model is whenever an adversarial example (AE) can be found. In this study, we first formalise a probabilistic notion of T2I DMs' robustness; and then devise an efficient framework, ProTIP, to evaluate it with statistical guarantees. The main challenges stem from: i) the high computational cost of the image generation process; and ii)identifying if a perturbed input is an AE involves comparing two output distributions, which is fundamentally harder compared to other DL tasks like classification where an AE is identified upon misprediction of labels. To tackle the challenges, we employ sequential analysis with efficacy and futility early stopping rules in the statistical testing for identifying AEs, and adaptive concentration inequalities to dynamically determine the “just-right” number of stochastic perturbations whenever the verification target is met. Empirical experiments validate ProTIP's effectiveness and efficiency, and showcase its application in ranking common defence methods.

# 79

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Amandeep Kumar · Muhammad Awais · Sanath Narayan · Hisham Cholakkal · Salman Khan · Rao Muhammad Anwer

Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we propose an efficient, plug-and-play, 3D-aware face editing framework, based on attribute-specific prompt learning, enabling the generation of facial images with controllable attributes across various target poses. To this end, we introduce a text-driven learnable style token-based latent attribute editor (LAE). The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN. It utilizes learnable style tokens and style mappers to learn and transform this editing direction to 3D latent space. To train LAE with multiple attributes, we use directional contrastive loss and style token loss. Furthermore, to ensure view consistency and identity preservation across different poses and attributes, we employ several 3D-aware identity and pose preservation losses. Our experiments show that our proposed framework generates high-quality images with 3D awareness and view consistency while maintaining attribute-specific features. We demonstrate the effectiveness of our method on different facial attributes, including hair color and style, expression, and others. Our Source code and models will be publicly released.

# 63

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

Yimeng Zhang · jinghan jia · Xin Chen · Aochuan Chen · Yihua Zhang · Jiancheng Liu · Ke Ding · Sijia Liu

The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic classification abilities of DMs to simplify the creation of adversarial prompts, thereby eliminating the need for auxiliary classification or diffusion models.Through extensive benchmarking, we evaluate the robustness of five widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. Our results demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the state-of-the-art adversarial prompt generation method and reveal the lack of robustness of current safety-driven unlearning techniques when applied to DMs.

# 67

Strong Double Blind

The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations

Anselm Haselhoff · Kevin Trelenberg · Fabian Küppers · Jonas Schneider

Visual counterfactual explanation (CF) methods modify image concepts, e.g., shape, to change a prediction to a predefined outcome while closely resembling the original query image. Unlike self-explainable models (SEMs) and heatmap techniques, they grant users the ability to examine hypothetical "what-if" scenarios. Previous CF methods either entail post-hoc training, limiting the balance between transparency and CF quality, or demand optimization during inference. To bridge the gap between transparent SEMs and CF methods, we introduce the GdVAE, a self-explainable model based on a conditional variational autoencoder (CVAE), featuring a Gaussian discriminant analysis (GDA) classifier and integrated CF explanations. Full transparency is achieved through a generative classifier that leverages class-specific prototypes for the downstream task and a closed-form solution for CFs in the latent space. The consistency of CFs is improved by regularizing the latent space with the explainer function. Extensive comparisons with existing approaches affirm the effectiveness of our method in producing high-quality CF explanations while preserving transparency.

# 73

Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution

Fengyuan Liu · Haochen Luo · Yiming Li · Philip Torr · Jindong Gu

Recent progress in visual generative models enables the generation of high-quality images. To prevent the misuse of generated images, it is important to identify the origin model that generates them. In this work, we study the origin attribution of generated images in a practical setting where only a few images generated by a source model are available and the source model cannot be accessed. The goal is to check if a given image is generated by the source model. We first formulate this problem as a few-shot one-class classification task. To solve the task, we propose OCC-CLIP, a CLIP-based framework for few-shot one-class classification, enabling the identification of an image's source model, even among multiple candidates. Extensive experiments corresponding to various generative models verify the effectiveness of our OCC-CLIP framework. Furthermore, an experiment based on the recently released DALL·E-3 API verifies the real-world applicability of our solution.

# 57

Strong Double Blind

DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models

Yuyang Huang · Yabo Chen · Yuchen Liu · Xiaopeng Zhang · Wenrui Dai · Hongkai Xiong · Qi Tian

Latent Diffusion Models (LDMs) are powerful and potential tools for facilitating generation-based methods for domain generalization. However, existing diffusion-based DG methods are restricted to offline augmentation using LDM and suffer from degraded performance and prohibitive computational costs. To address these challenges, we propose DomainFusion to simultaneously achieve knowledge extraction in the latent space and augmentation in the pixel space of the Latent Diffusion Model (LDM) for efficiently and sufficiently exploiting LDM. We develop a Latent Distillation module that distills gradient priors from LDM to guide the optimization of DG models. Moreover, we design an online lightweight augmentation method by decomposing candidate images into styles and contents for using LDM in a fast and online fashion. Experimental results demonstrate that DomainFusion outperforms diffusion-based methods by a large margin and achieves SOTA performance on existing DG benchmark datasets. Remarkably, DomainFusion can significantly reduce the number of generated images (e.g. by more than 97% on DomainNet) without finetuning LDM.

# 77

Strong Double Blind

AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation

Ri-Zhao Qiu · Yu-Xiong Wang · Kris Hauser

Text-to-image diffusion models have shown remarkable success in synthesizing photo-realistic images. Apart from creative applications, can we use such models to synthesize samples that aid the few-shot training of discriminative models? In this work, we propose AlignDiff, a general framework for synthesizing training images and masks for few-shot segmentation. We identify two crucial misalignments that arise when utilizing pre-trained diffusion models in segmentation tasks, which need to be addressed to create realistic training samples and align the synthetic data distribution with the real training distribution: 1) instance-level misalignment, where generated samples of rare categories are often misaligned with target tasks) and 2) annotation-level misalignment, where diffusion models are limited to generating images without pixel-level annotations. AlignDiff overcomes both challenges by leveraging a few real samples to guide the generation, thus improving novel IoU over baseline methods in few-shot segmentation and generalized few-shot segmentation on Pascal-5i and COCO-20i by up to 80%. Notably, AlignDiff is capable of augmenting the learning of out-of-distribution uncommon categories on FSS-1000, while naive diffusion model generates samples that diminish segmentation performance.

# 55

Memory-Efficient Fine-Tuning for Quantized Diffusion Model

Hyogon Ryu · Seohyun Lim · Hyunjung Shim

The rise of billion-parameter diffusion models such as Stable Diffusion XL, Imagen, and Dall-E3 significantly propels the domain of generative AI. However, their large-scale architecture presents challenges in fine-tuning and deployment due to high resource demands and slow inference speed. This paper delves into relatively unexplored yet promising realm of fine-tuning quantized diffusion models. Our analysis identified that the baseline neglects the distinct pattern in model weights and different roles throughout time-step when finetuning the diffusion model. To address these limitations, we introduce a novel memory-efficient fine-tuning framework directly applicable to quantized diffusion models, dubbed TuneQDM. Our approach introduces quantization scales as separable functions to consider inter-channel patterns of weight and optimizes scales in a time-step specific manner for effective reflection of the role of time-step. TuneQDM demonstrates performance on par with its full-precision counterpart, while simultaneously offering a substantial advantage in terms of memory efficiency. The experimental results demonstrate that our efficient framework consistently outperforms the baseline in single-/multi-subject generation, exhibiting high subject fidelity and prompt fidelity comparable to the full precision model.

# 52

Strong Double Blind

SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow

Yuanzhi Zhu · Xingchao Liu · Qiang Liu

Diffusion and flow-based models nowadays have notable success in generating diverse and high-quality images. However, their iterative sampling process and substantial model size pose challenges in fast generation and downstream applications. In this paper, we propose SlimReFlow, a new method to obtain efficient one-step diffusion models as an extension of Rectified Flows. We first propose Annealing Rectifying to avoid the training of 1-flows, which results in 2-flow models directly from the data pairs generated by pre-trained diffusion models. Then we introduced a new distillation loss with additional supervision from the 2-flow models to get better one-step distilled flows. We illustrate the versatility of our method by applying it to various diffusion and flow-based models, including Rectified Flows and EDM, as shown through the use of data pairs. Through extensive examination of various model sizes and dataset choices, we demonstrate that our approach can significantly lower the number of parameters in the models while maintaining the quality of one-step image generation. Our method achieves an FID of 5.02 with 15.7M parameters and 4.53 with 27.9M parameters on CIFAR-10 32$\times$32 in one-step generation. On the FFHQ 64$\times$64 dataset, we record FIDs of 7.70 and 7.21 with 15.7M and 27.9M parameters, respectively. Additionally, on the ImageNet 64$\times$64 dataset, our method secures an FID of 12.34 using only 80.7M parameters.

# 87

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Shen Zhang · Zhaowei CHEN · Zhenyu Zhao · Yuhao Chen · Yao Tang · Jiajun Liang

Diffusion models have become a mainstream approach for high-resolution image synthesis. However, directly generating \textbf{higher-resolution} images from pretrained diffusion models will encounter unreasonable object duplication and exponentially increase the generation time. In this paper, we discover that object duplication arises from feature duplication in the deep blocks of the U-Net. Concurrently, We pinpoint the extended generation times to self-attention redundancy in U-Net's top blocks. To address these issues, we propose a tuning-free higher-resolution framework named HiDiffusion. Specifically, HiDiffusion contains Resolution-Aware U-Net~(RAU-Net) that dynamically adjusts the feature map size to resolve object duplication and engages Modified Shifted Window Multi-head Self-Attention~(MSW-MSA) that utilizes optimized window attention to reduce computations. we can integrate HiDiffusion into various pretrained diffusion models to scale image generation resolutions even to 4096×4096 at 1.5-6× the inference speed of previous methods. Extensive experiments demonstrate that our approach can address object duplication and heavy computation issues, achieving state-of-the-art performance on higher-resolution image synthesis tasks.

# 17

EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

Nikolai Körber · Eduard Kromer · Andreas Siebert · Sascha Hauke · Daniel Mueller-Gritschneder · Björn Schuller

We introduce EGIC, an enhanced generative image compression method that allows traversing the distortion-perception curve efficiently from a single model. EGIC is based on two novel building blocks: i) OASIS-C, a conditional pre-trained semantic segmentation-guided discriminator, which provides both spatially and semantically-aware gradient feedback to the generator, conditioned on the latent image distribution, and ii) Output Residual Prediction (ORP), a retrofit solution for multi-realism image compression that allows control over the synthesis process by adjusting the impact of the residual between an MSE-optimized and GAN-optimized decoder output on the GAN-based reconstruction. Together, EGIC forms a powerful codec, outperforming state-of-the-art diffusion and GAN-based methods (e.g., HiFiC, MS-ILLM, and DIRAC-100), while performing almost on par with VTM-20.0 on the distortion end. EGIC is simple to implement, very lightweight, and provides excellent interpolation characteristics, which makes it a promising candidate for practical applications targeting the low bit range.

# 18

Diffusion for Natural Image Matting

Yihan Hu · Yiheng Lin · Wei Wang · Yao Zhao · Yunchao Wei · Humphrey Shi

Existing natural image matting algorithms inevitably have flaws in their predictions on difficult cases, and their one-step prediction manner cannot further correct these errors. In this paper, we investigate a multi-step iterative approach for the first time to tackle the challenging natural image matting task, and achieve excellent performance by introducing a pixel-level denoising diffusion method (DiffMatte) for the alpha matte refinement. To improve iteration efficiency, we design a lightweight diffusion decoder as the only iterative component to directly denoise the alpha matte, saving the huge computational overhead of repeatedly encoding matting features. We also propose an ameliorated self-aligned strategy to consolidate the performance gains brought about by the iterative diffusion process. This allows the model to adapt to various types of errors by aligning the noisy samples used in training and inference, mitigating performance degradation caused by sampling drift. Extensive experimental results demonstrate that DiffMatte not only reaches the state-of-the-art level on the mainstream Composition-1k test set, surpassing the previous best methods by 8% and 15 in the SAD metric and MSE metric respectively, but also show stronger generalization ability in other benchmarks. The code will be open-sourced for the following research and applications.

# 22

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Byeongjun Park · Hyojun Go · Jin-Young Kim · Sangmin Woo · Seokil Ham · Changick Kim

Diffusion models have achieved remarkable success across a range of generative tasks. Recent efforts to enhance diffusion model architectures have reimagined them as a form of multi-task learning, where each task corresponds to a denoising task at a specific noise level. While these efforts have focused on parameter isolation and task routing, they fall short of capturing detailed inter-task relationships and risk losing semantic information, respectively. In response, we introduce Switch Diffusion Transformer (Switch-DiT), which establishes inter-task relationships between conflicting tasks without compromising semantic information. To achieve this, we employ a sparse mixture-of-experts within each transformer block to utilize semantic information and facilitate handling conflicts in tasks through parameter isolation. Additionally, we propose a diffusion prior loss, encouraging similar tasks to share their denoising paths while isolating conflicting ones. Through these, each transformer block contains a shared expert across all tasks, where the common and task-specific denoising paths enable the diffusion model to construct its beneficial way of synergizing denoising tasks. Extensive experiments validate the effectiveness of our approach in improving both image quality and convergence rate, and further analysis demonstrates that Switch-DiT constructs tailored denoising paths across various generation scenarios.

# 12

Strong Double Blind

MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

Yulin Ren · Xin Li · Bingchen Li · Xingrui Wang · Mengxi China Guo · Shijie Zhao · Li Zhang · Zhibo Chen

We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, \eg, JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develops the powerful mixture-of-experts (MoE) prompt module, where some basic prompts cooperate to excavate the task-customized diffusion priors from Stable Diffusion (SD) for each compression task. Moreover, the degradation-aware routing mechanism is proposed to enable the flexible assignment of basic prompts. To activate and reuse the cross-modality generation prior of SD, we design the visual-to-text adapter for MoE-DiffIR, which aims to adapt the embedding of low-quality images from the visual domain to the textual domain as the textual guidance for SD, enabling more consistent and reasonable texture generation. We also construct one comprehensive benchmark dataset for universal CIR, covering 21 types of degradations from 7 popular traditional and learned codecs. Extensive experiments on universal CIR have demonstrated the excellent robustness and texture restoration capability of our proposed MoE-DiffIR.

# 20

Strong Double Blind

TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts

Youssef Mansour · Xuyang Zhong · Serdar Caglar · Reinhard Heckel

Neural networks trained end-to-end give state-of-the-art performance for image denoising. However, when applied to an image outside of the training distribution, the performance often degrades significantly. In this work, we propose a test-time training (TTT) method based on masked image modeling (MIM) to improve denoising performance for out-of-distribution images. The method, termed TTT-MIM, consists of a training stage and a test time adaptation stage. At training, we minimize a standard supervised loss and a self-supervised loss aimed at reconstructing masked image patches. At test-time, we minimize a self-supervised loss to fine-tune the network to adapt to a single noisy image. Experiments show that our method can improve performance under natural distribution shifts, in particular it adapts well to real-world camera and microscope noise. A competitor to our method of training and finetuning is to use a zero-shot denoiser that does not rely on training data. However, compared to state-of-the-art zero-shot denoisers, our method shows superior performance, and is much faster, suggesting that training and finetuning on the test instance is a more efficient approach to image denoising than zero-shot methods in setups where little to no data is available.

# 14

Strong Double Blind

Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration

Chu Jie Qin · Ruiqi Wu · Zikun Liu · Xin Lin · Chun-Le Guo · Hyun Hee Park · Chongyi Li

All-in-one image restoration aims to handle multiple degradation types using one model. We propose a simple pipeline for all-in-one blind image restoration to Restore Anything with Masks (RAM). We focus on the image content itself by utilizing the MIM to extract intrinsic image information rather than distinguishing degradation types like other methods. Our pipeline consists of two stages: masked image pre-training and fine-tuning with mask attribute conductance. We design a simple masking pre-training approach tailored to all-in-one image restoration, boosting networks to focus more on extracting image content priors from any degradation, which turns out to be a more balanced (between different restoration tasks) and stronger performance. To bridge the gap of input integrity while preserving learned image priors as much as possible, we selectively fine-tuned a small portion of the layers. Specifically, the importance of each layer is ranked by the proposed Mask Attribute Conductance (MAC), and the layers with higher contributions are selected for finetuning. Extensive quantitative and qualitative experiments demonstrate that our method achieves state-of-the-art performance. Our code and model will be released.

# 10

Strong Double Blind

Confidence-Based Iterative Generation for Real-World Image Super-Resolution

Jialun Peng · Xin Luo · Jingjing Fu · Dong Liu

Real-world image super-resolution deals with complex and unknown degradations, making it challenging to produce plausible results in a single step. In this work, we propose a transformer model with an iterative generation process that iteratively refines the results based on predicted confidences. It allows the model to focus on regions with low confidences and generate more confident and accurate results. Specifically, our model learns to predict the visual tokens of the high-resolution image and their corresponding confidence scores, conditioned on the low-resolution image. By keeping only the most confident tokens at each iteration and re-predicting the other tokens in the next iteration, our model generates all high-resolution tokens within a few steps. To ensure consistency with the low-resolution input image, we further propose a conditional controlling module that utilizes the low-resolution image to control the decoding process from high-resolution tokens to image pixels. Experiments demonstrate that our model achieves state-of-the-art performance on real-world datasets while requiring fewer iteration steps compared to recent diffusion models.

# 13

Strong Double Blind

Efficient Frequency-Domain Image Deraining with Contrastive Regularization

Ning Gao · xingyu jiang · Xiuhui Zhang · Yue Deng

Most current single image-deraining (SID) methods are based on the Transformer, which brings global modeling capabilities and is critical for high-quality reconstruction. However, their architectures only consider constructing long-range dependencies from the spatial domain, which suffers from a significant computational burden to keep effectiveness. Besides, these methods either overlook negative sample information in the optimization pipeline or underutilize the rain streak characteristics present in the negative ones. To tackle these problems, we propose a Frequency-Aware Deraining Transformer Framework (FADformer) that fully captures frequency domain features to achieve efficient rain removal. Specifically, we first construct the FADBlock, which comprises the Fused Fourier Convolution Mixer (FFCM) and Prior-Gated Feed-forward Network (PGFN). Unlike self-attention mechanisms, the FFCM exclusively conducts convolution operations in both spatial and frequency domains, endowing it with local-global capturing capabilities and efficiency. Simultaneously, the PGFN introduces residue channel prior in a gate-controlled manner to both enhance local details and retain the structure of features. Furthermore, we introduce a Frequency-domain Contrastive Regularization (FCR) during the training phase. The FCR facilitates contrastive learning in the frequency domain, enhancing the contribution of rain streak patterns in negative samples to improve performance. Extensive experiments on synthetic and real-world datasets show that the proposed method significantly outperforms the state-of-the-art approaches. We will release the code soon after the paper is accepted.

# 19

Strong Double Blind

Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding

Jiangtao Zhang · Zongsheng Yue · Hui Wang · Qian Zhao · Deyu Meng

Blind image deconvolution (BID) is a classic yet challenging problem in the field of image processing. Recent advances in deep image prior (DIP) have motivated a series of DIP-based approaches, demonstrating remarkable success in BID, particularly in scenarios with motion blurring. However, due to the high non-convexity of the inherent optimization process, these methods are notorious for their sensitivity to the initialized kernel. To alleviate this issue and further improve their performance, we propose a new framework for BID that better considers the prior modeling and the initialization for blur kernels, in particular of motion blur, leveraging a deep generative model. The proposed approach pre-trains a generative adversarial network-based kernel generator that aptly characterizes the kernel priors and a kernel initializer that facilitates a well-informed initialization for the blur kernel through latent space encoding. With the pre-trained kernel generator and initializer, one can obtain a high-quality initialization of the blur kernel, and enable optimization within a compact latent kernel manifold. Such a framework results in an evident performance improvement over existing DIP-based BID methods. Extensive experiments on different datasets demonstrate the effectiveness of the proposed method. Notably, it achieves state-of-the-art performance on the challenging benchmark by Lai et al.

# 313

Strong Double Blind

SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging

Lingtong Kong · Bo Li · Yike Xiong · Hao Zhang · Hong Gu · Jinwei Chen

Reconstructing High Dynamic Range (HDR) image from multiple Low Dynamic Range (LDR) images with different exposures is a challenging task when facing truncated texture and complex motion. Existing deep learning-based methods have achieved great success by either following the alignment and fusion pipeline or utilizing attention mechanism. However, the large computation cost and inference delay hinder them from deploying on resource limited devices. In this paper, to achieve better efficiency, a novel Selective Alignment Fusion Network (SAFNet) for HDR imaging is proposed. After extracting pyramid features, it jointly refines valuable area masks and cross-exposure motion in selected regions with shared decoders, and then fuses high quality HDR image in an explicit way. This approach can focus the model on finding valuable regions while estimating their easily detectable and meaningful motion. For further detail enhancement, a lightweight refine module is introduced which enjoys privileges from previous optical flow, selection masks and initial prediction. Moreover, to facilitate learning on samples with large motion, a new window partition cropping method is presented during training. Experiments on public and newly developed challenging datasets show that proposed SAFNet not only exceeds previous SOTA competitors quantitatively and qualitatively, but also runs order of magnitude faster. Code and dataset will be available upon publication.

# 9

Strong Double Blind

Rethinking Image Super Resolution from Training Data Perspectives

Go Ohtani · Ryu Tadokoro · Ryosuke Yamada · Yuki M Asano · Iro Laina · Christian Rupprecht · Nakamasa Inoue · Rio Yokota · Hirokatsu Kataoka · Yoshimitsu Aoki

In this work, we investigate the understudied effect of the training data used for image super-resolution (SR). Most commonly, novel SR methods are developed and benchmarked on common training datasets such as DIV2K and DF2K. However, we investigate and rethink the training data from the perspectives of diversity and quality, thereby addressing the question of "How important is SR training for SR models?". To this end, we propose an automated image evaluation pipeline. With this, we stratify existing high-resolution image datasets and larger-scale image datasets such as ImageNet and PASS to compare their performances. We find that datasets with (i) low compression artifacts, (ii) high within-image diversity as judged by the number of different objects, and (iii) a large number of images from ImageNet or PASS all positively affect SR performance. We hope that the proposed simple-yet-effective dataset curation pipeline will inform the construction of SR datasets in the future and yield overall better models.

# 5

Strong Double Blind

Accelerating Image Super-Resolution Networks with Pixel-Level Classification

Jinho Jeong · Jinwoo Kim · Younghyun Jo · Seon Joo Kim

Single Image Super-Resolution (SISR) plays a vital role in various applications, driven by advancements in Deep Neural Networks (DNNs). However, increasing model complexity raises computational costs, necessitating efficient solutions. Existing patch-based approaches aiming at efficient SR encounter challenges in adapting to varying pixel difficulties and suffer from decreased efficiency with larger patches. To address these limitations, we propose Pixel-level Classifier for Single Image Super-Resolution (PCSR), a novel pixel-level distribution method for efficient SISR. Our approach optimizes computational resource allocation at the pixel level, achieving better efficiency compared to patch-based methods, and also provides user tunability without re-training and mitigates artifacts through post-processing techniques. Experimental results demonstrate the effectiveness of PCSR across diverse SISR models and benchmarks, surpassing existing approaches in terms of the PSNR-FLOPs trade-off.

# 7

Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks

Cheeun Hong · Kyoung Mu Lee

Quantization is a promising approach to reduce the high computational complexity of image super-resolution (SR) networks. However, low-bit quantization leads to severe accuracy loss in SR networks compared to high-level tasks such as image classification. This is because the feature distributions of the SR networks are significantly divergent for each channel or input image, making it difficult to determine a quantization range. Existing SR quantization works approach this distribution mismatch problem by dynamically adapting quantization ranges to the variant distributions during the test time. However, such a dynamic adaptation incurs additional computational costs that limit the benefits of quantization. Instead, we propose a new quantization-aware training framework that effectively overcomes the distribution mismatch problem in SR networks without the need for dynamic adaptation. Intuitively, the mismatch can be reduced by directly regularizing the distance between the feature to be quantized and the quantization grids during training. However, we observe that mismatch regularization can collide with reconstruction loss during training and adversely affect the SR accuracy. Thus, we avoid the conflict between two losses by regularizing the mismatch only when the gradients of mismatch regularization are cooperative with those of reconstruction loss. Additionally, we introduce a layer-wise weight clipping correcting scheme to find a better quantization range for layer-wise different weights. Experimental results show that our algorithm effectively reduces the distribution mismatch, achieving state-of-the-art performance with minimal computational overhead. Our code will be released.

# 6

Strong Double Blind

Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model

Zhening Liu · XINJIE ZHANG · Jiawei Shao · Zehong Lin · Jun Zhang

With the rapid advancement of stereo vision technologies, stereo image compression has emerged as a crucial field that continues to draw significant attention. Previous approaches have primarily employed a unidirectional paradigm, where the compression of one view is dependent on the other, resulting in imbalanced compression. To address this issue, we introduce a symmetric bidirectional stereo image compression architecture, named BiSIC. Specifically, we propose a 3D convolution based codec backbone to capture local features and incorporate bidirectional attention blocks to exploit global features. Moreover, we design a novel cross-dimensional entropy model that integrates various conditioning factors, including the spatial context, channel context, and stereo dependency, to effectively estimate the distribution of latent representations for entropy coding. Extensive experiments demonstrate that our proposed BiSIC outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM.

# 2

Strong Double Blind

Uncertainty-Driven Spectral Compressive Imaging with Spatial-Frequency Transformer

Lintao Peng · Siyu Xie · Liheng Bian

Recently, learning-based Hyperspectral image (HSI) reconstruction methods have demonstrated promising performance and dominated the mainstream research direction. However, existing learning-based methods still have two issues. 1) Unable to consider both the spatial sparsity and the inter-spectral similarity prior of HSI. 2) Treat all regions equally, ignoring that texture-rich regions and edge regions are more difficult to reconstruct than smooth regions. To address these issues, we propose an uncertainty-driven HSI reconstruction method termed Specformer. Specifically, we first introduce a frequency-wise self-attention (FWSA) and combine it with spatial-wise local-window self-attention (LWSA) with a parallel design to form a Spatial-Frequency (SF) block. LWSA can guide the network to focus on the regions with dense spectral information, and FWSA can capture the inter-spectral similarity. Parallel design helps the network to model cross-window connections, expand its receptive fields while maintaining linear complexity. We use SF-block as the main building block in a multi-scale U-shape network to form our Specformer. In addition, we introduce an uncertainty-driven self-adaptive loss function, which can reinforce the network's attention to challenging regions with rich textures and edges. Comprehensive experiments show that our Specformer significantly outperforms state-of-the-art methods on simulation and real HSI datasets while requiring cheaper computational and memory costs. The code will be publicly available.

# 1

Strong Double Blind

Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing

Seongmin Hong · Jaehyeok Bae · Jongho Lee · Se Young Chun

Compressed sensing (CS) has emerged to overcome the inefficiency of Nyquist sampling. However, traditional optimization-based reconstruction is slow and can not yield an exact image in practice. Deep learning-based reconstruction has been a promising alternative to optimization-based reconstruction, outperforming it in accuracy and computation speed. Finding an efficient sampling method with deep learning-based reconstruction, especially for Fourier CS remains a challenge. Existing joint optimization of sampling-reconstruction works (H1) optimize the sampling mask but have low potential as it is not adaptive to each data point. Adaptive sampling (H2) has also disadvantages of difficult optimization and Pareto sub-optimality. Here, we propose a novel adaptive selection of sampling-reconstruction (H1.5) framework that selects the best sampling mask and reconstruction network for each input data. We provide theorems that our method has a higher potential than H1 and effectively solves the Pareto sub-optimality problem in sampling-reconstruction by using separate reconstruction networks for different sampling masks. To select the best sampling mask, we propose to quantify the high-frequency Bayesian uncertainty of the input, using a super-resolution space generation model. Our method outperforms joint optimization of sampling-reconstruction (H1) and adaptive sampling (H2) by achieving significant improvements on several Fourier CS problems.

# 306

Strong Double Blind

Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers

Zhao Yutian · Tianjing Zhang · Hui JI

Image reconstruction from incomplete measurements is a basic task in medical imaging. While supervised deep learning proves to be a powerful tool for image reconstruction, it demands a substantial number of latent images for training. To extend the application of deep learning to medical imaging where collecting latent images poses challenges, this paper introduces an unsupervised test-time adaptation approach. The proposed approach leverages a pre-trained model on an external dataset and efficiently adapts it to each test sample so that the model performs optimally on each specific sample. Model adaption is done by introducing an unrolling network with additional lightweight adaptive linear convolution layers, enabling efficient alignment of testing samples with the distribution targeted in the pre-trained model. This approach is inspired by the connection between linear convolutional network and Wiener filtering. Extensive experiments showed significant performance gain of the proposed method over other unsupervised methods and model adaptation techniques in two medical imaging tasks.

# 300

RadEdit: stress-testing biomedical vision models via diffusion image editing

Fernando Pérez-García · Sam Bond-Taylor · Pedro Sanchez · Boris van Breugel · Daniel Coelho de Castro · Harshita Sharma · Valentina Salvatelli · Maria Teodora A Wetscherek · Hannah CM Richardson · Lungren Matthew · Aditya Nori · Javier Alvarez-Valle · Ozan Oktay · Maximilian Ilse

Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost and patient harm. Existing editing methods can produce undesirable changes, with spurious correlations learned due to the co-occurrence of disease and treatment interventions, limiting practical applicability. To address this, we train a text-to-image diffusion model on multiple chest X-ray datasets and introduce a new editing method, RadEdit, that uses multiple image masks, if present, to constrain changes and ensure consistency in the edited images, minimising bias. We consider three types of dataset shifts: acquisition shift, manifestation shift, and population shift, and demonstrate that our approach can diagnose failures and quantify model robustness without additional data collection, complementing more qualitative tools for explainable AI.

# 8

Strong Double Blind

Rate-Distortion-Cognition Controllable Versatile Neural Image Compression

Jinming Liu · Ruoyu Feng · Yunpeng Qi · Qiuyu Chen · Zhibo Chen · Wenjun Zeng · Xin Jin

Recently, the field of Image Coding for Machines (ICM) has garnered heightened interest and significant advances thanks to the rapid progress of learning-based techniques for image compression and analysis. Previous studies often require training separate codecs to support various bitrate levels, machine tasks, and networks, thus lacking both flexibility and practicality. To address these challenges, we propose a rate-distortion-cognition controllable versatile image compression, which method allows the users to adjust the bitrate (i.e., Rate), image reconstruction quality (i.e., Distortion), and machine task accuracy (i.e., Cognition) with a single neural model, achieving ultra-controllability. Specifically, we first introduce a cognition-oriented loss in the primary compression branch to train a codec for diverse machine tasks. This branch attains variable bitrate by regulating quantization degree through the latent code channels. To further enhance the quality of the reconstructed images, we employ an auxiliary branch to supplement residual information with a scalable bitstream. Ultimately, two branches use a `$\beta x + (1 - \beta) y$' interpolation strategy to achieve a balanced cognition-distortion trade-off. Extensive experiments demonstrate that our method yields satisfactory ICM performance and flexible Rate-Distortion-Cognition controlling.

# 4

Strong Double Blind

Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design

Li · zhihao shu · Jie Ji · Minghai Qin · Fatemeh Afghah · Wei Niu · Xiaolong Ma

Deep neural networks (DNNs) are frequently employed in a variety of computer vision applications. Nowadays, an emerging trend in the current video distribution system is to take advantage of DNN's overfitting properties to perform video resolution upscaling. By splitting videos into chunks and applying a super-resolution (SR) model to overfit each chunk, this scheme of SR models plus video chunks is able to replace traditional video transmission to enhance video quality and transmission efficiency. However, many models and chunks are needed to guarantee high performance, which leads to tremendous overhead on model switching and memory footprints at the user end. To resolve such problems, we propose a \underline{Dy}namic \underline{D}eep neural network assisted by a \underline{C}ontent-\underline{A}ware data processing pipeline to reduce the model number down to one (Dy-DCA), which helps promote performance while conserving computational resources. Additionally, to achieve real acceleration on the user end, we designed a framework that optimizes dynamic features (e.g., dynamic shapes, sizes, and control flow) in Dy-DCA to enable a series of compilation optimizations, including fused code generation, static execution planning, etc. By employing such techniques, our method achieves better PSNR and real-time performance (33 FPS) on an off-the-shelf mobile phone. Meanwhile, assisted by our compilation optimization, we achieve a 1.7-times speedup while saving up to 1.61-times memory consumption.

# 179

Strong Double Blind

Fast Encoding and Decoding for Implicit Video Representation

Hao Chen · Saining Xie · Ser-Nam Lim · Abhinav Shrivastava

Despite the abundant availability and content richness for video data, its high-dimensionality poses challenges for video research. Recent advancements have explored the implicit representation for videos using deep neural networks, demonstrating strong performance in applications such as video compression and enhancement. However, the prolonged encoding time remains a persistent challenge for video Implicit Neural Representations (INRs). In this paper, we focus on improving the speed of video encoding and decoding within implicit representations. We introduce two key components: NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, an efficient video loader designed to streamline video research. NeRV-Enc achieves an impressive speed-up of by eliminating gradient-based optimization. Meanwhile, NeRV-Dec simplifies video decoding, outperforming conventional codecs with a loading speed faster, and surpassing RAM loading with pre-decoded videos ( faster while being smaller in size).

# 15

Implicit Steganography Beyond the Constraints of Modality

Sojeong Song · Seoyun Yang · Chang Yoo · Junmo Kim

Cross-modal steganography is committed to hiding secret information of one modality in another modality. Despite the advancement in the field of steganography by the introduction of deep learning, cross-modal steganography still remains to be a challenge to the field. The incompatibility between different modalities not only complicate the hiding process but also results in increased vulnerability to detection. To rectify these limitations, we present INRSteg, an innovative cross-modal steganography framework based on Implicit Neural Representations (INRs). We introduce a novel network allocating framework with a masked parameter update which facilitates hiding multiple data and enables cross modality across image, audio, video and 3D shape. Moreover, we eliminate the necessity of training a deep neural network and therefore substantially reduce the memory and computational cost and avoid domain adaptation issues. To our knowledge, in the field of steganography, this is the first to introduce diverse modalities to both the secret and cover data. Detailed experiments in extreme modality settings demonstrate the flexibility, security, and robustness of INRSteg.

# 56

Strong Double Blind

Certifiably Robust Image Watermark

Zhengyuan Jiang · Moyang Guo · Yuepeng Hu · Jinyuan Jia · Neil Zhenqiang Gong

Generative AI raises many societal concerns such as boosting disinformation and propaganda campaigns. Watermarking AI-generated content is a key technology to address these concerns and has been widely deployed in industry. However, watermarking is vulnerable to removal attacks and forgery attacks. In this work, we propose the first image watermarks with certified robustness guarantees against removal and forgery attacks. Our method leverages randomized smoothing, a popular technique to build certifiably robust classifiers and regression models. Our major technical contributions include extending randomized smoothing to watermarking by considering its unique characteristics, deriving the certified robustness guarantees, and designing algorithms to estimate them. Moreover, we extensively evaluate our image watermarks in terms of both certified and empirical robustness.

# 130

Strong Double Blind

DSA: Discriminative Scatter Analysis for Early Smoke Segmentation

Lujian Yao · Haitao Zhao · Jingchao Peng · Zhongze Wang · Kaijie Zhao

Early smoke segmentation (ESS) plays a crucial role in accurately locating the source of smoke, facilitating prompt fire rescue operations and gas leak detection. Unlike regular objects, which are typically rigid, opaque, and have clear boundaries, ESS presents challenges due to the large areas of high transparency in early smoke. This leads to a significant similarity between smoke features and the surrounding background features. The key solution is to obtain a discriminative embedding space. Some distance-based methods have pursued this goal by using specific loss functions (e.g., pair-based Triplet loss and proxy-based NCA loss) to constrain the feature extractor. In this paper, we propose a novel approach called discriminative scatter analysis (DSA). Instead of solely measuring Euclidean distance, DSA assesses the compactness and separation of the embedding space from a sample scatter perspective. DSA is performed on both pixel-proxy scatter (IOS) and proxy-proxy scatter (OOS), and a unified loss function is designed to optimize the feature extractor. DSA can be easily integrated with regular segmentation methods. It is applied only during training and without incurring any additional computational cost during inference. Extensive experiments have demonstrated that DSA can consistently improve the performance of various models in ESS.

# 21

Strong Double Blind

AdaIFL: Adaptive Image Forgery Localization via a Dynamic and Importance-aware Transformer Network

Yuxi Li · Fuyuan Cheng · Wangbo Yu · Guangshuo Wang · Guibo Luo · Yuesheng Zhu

The rapid development of image processing and manipulation techniques poses unprecedented challenges in multimedia forensics, especially in Image Forgery Localization (IFL). This paper addresses two key challenges in IFL: (1) Various forgery techniques leave distinct forensic traces. However, existing models overlook variations among forgery patterns. The diversity of forgery techniques makes it challenging for a single static detection method and network structure to be universally applicable. To address this, we propose AdaIFL, a dynamic IFL framework that customizes various expert groups for different network components, constructing multiple distinct feature subspaces. By leveraging adaptively activated experts, AdaIFL can capture discriminative features associated with forgery patterns, thereby enhancing the model's generalization ability. (2) Many forensic traces and artifacts are located at the boundaries of the forged region. Existing models either ignore the differences in discriminative information or use edge supervision loss to force the model to focus on the region boundaries. This hard-constrained approach is prone to attention bias, causing the model to be overly sensitive to image edges or fail to finely capture all forensic traces. To address this, we propose a feature importance-aware attention, a flexible approach that adaptively perceives the importance of different regions and aggregates region features into variable-length tokens, directing the model's attention towards more discriminative and informative regions. Extensive experiments on benchmark datasets demonstrate that AdaIFL outperforms state-of-the-art image forgery localization methods.

# 23

Strong Double Blind

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Xinxu Ge · Xin Liu · Zitong Yu · Jingang Shi · Chun Qi · Jie Li · Heikki Kälviäinen

Face anti-spoofing (FAS) plays a vital role in preventing face recognition (FR) systems from presentation attacks. Nowadays, FAS systems face the challenge of domain shift, impacting the generalization performance of existing FAS methods. In this paper, we rethink about the inherence of domain shift and deconstruct it into two factors: image style and image quality. Quality influences the purity of the presentation of spoof information, while style affects the manner in which spoof information is presented. Based on our analysis, we propose DiffFAS framework, which quantifies quality as prior information input into the network to counter image quality shift, and performs diffusion-based high-fidelity cross-domain and cross-attack types generation to counter image style shift. DiffFAS transforms easily collectible live faces into high-fidelity attack faces with precise labels while maintaining consistency between live and spoof face identities, which can also alleviate the scarcity of labeled data with novel type attacks faced by nowadays FAS system. We demonstrate the effectiveness of our framework on challenging cross-domain and cross-attack FAS datasets, achieving the state-of-the-art performance. The code shall be released on the GitHub.

# 25

Strong Double Blind

Face Reconstruction Transfer Attack as Out-of-Distribution Generalization

Yoon Gyo Jung · Jaewoo Park · Xingbo Dong · Hojin Park · Andrew Beng Jin Teoh · Octavia Camps

Understanding the vulnerability of face recognition system to malicious attacks is of critical importance. Previous works have focused on reconstructing face images that can penetrate a targeted verification system. Even in the white-box scenario, however, naively reconstructed images misrepresent the identity information, hence the attacks are easily neutralized once the face system is updated or changed. In this paper, we aim to reconstruct face images which are capable of transferring face attacks on unseen networks. We term this problem as Face Reconstruction Transfer Attack (FRTA) and show that it can be formulated as an out-of-distribution (OOD) generalization problem. Inspired by its OOD nature, we propose to solve FRTA by Averaged Latent Search and Unsupervised Validation with pseudo target (ALSUV). To strengthen the reconstruction attack on OOD unseen networks, ALSUV reconstructs the face by searching the latent of amortized generator StyleGAN2 through multiple latent optimization, latent optimization trajectory averaging, and unsupervised validation with a pseudo target. We demonstrate the efficacy and generalization of our method on widely used face datasets, accompanying it with extensive ablation studies and analyses visually, qualitatively, and quantitatively. The source code will be released.

# 30

Toward Tiny and High-quality Facial Makeup with Data Amplify Learning

Qiaoqiao Jin · Xuanhong Chen · Meiguang Jin · Ying Chen · Rui Shi · Yucheng Zheng · Yupeng Zhu · Bingbing Ni

Contemporary makeup approaches primarily hinge on unpaired learning paradigms, yet they grapple with the challenges of inaccurate supervision (e.g., face misalignment) and sophisticated facial prompts (including face parsing, and landmark detection). These challenges prohibit low-cost deployment of facial makeup models, especially on mobile devices. To solve above problems, we propose a brand-new learning paradigm, termed "Data Amplify Learning (DAL)," alongside a compact makeup model named "TinyBeauty." The core idea of DAL lies in employing a Diffusion-based Data Amplifier (DDA) to "amplify" limited images for the model training, thereby enabling accurate pixel-to-pixel supervision with merely a handful of annotations. Two pivotal innovations in DDA facilitate the above training approach: (1) A Residual Diffusion Model (RDM) is designed to generate high-fidelity detail and circumvent the detail vanishing problem in the vanilla diffusion models; (2) A Fine-Grained Makeup Module (FGMM) is proposed to achieve precise makeup control and combination while retaining face identity. Coupled with DAL, TinyBeauty necessitates merely $\textbf{80K}$ parameters to achieve a state-of-the-art performance without intricate face prompts. Meanwhile, TinyBeauty achieves a remarkable inference speed of up to \textbf{460 fps} on the iPhone 13. Extensive experiments show that DAL can produce highly competitive makeup models using only \textbf{5} image pairs.

# 40

Facial Affective Behavior Analysis with Instruction Tuning

Yifan Li · Anh Dao · Wentao Bao · Zhen Tan · Tianlong Chen · Huan Liu · Yu Kong

Facial affective behavior analysis (FABA) is crucial for understanding human mental states from images. However, traditional approaches primarily deploy models to discriminate among discrete emotion categories, and lack the fine granularity and reasoning capability for complex facial behaviors. The advent of Multi-modal Large Language Models (MLLMs) has been proven successful in general visual understanding tasks. However, directly harnessing MLLMs for FABA is challenging due to the scarcity of datasets and benchmarks, neglecting facial prior knowledge, and low training efficiency. To address these challenges, we introduce (i) an instruction-following dataset for two FABA tasks, e.g., emotion and action unit recognition, (ii) a benchmark FABA-Bench with a new metric considering both recognition and generation ability, and (iii) a new MLLM ''EmoLA'' as a strong baseline to the community. Our initiative on the dataset and benchmarks reveal the nature and rationale of facial affective behaviors, i.e., fine-grained facial movement, interpretability, and reasoning. Moreover, to build an effective and efficient FABA MLLM, we introduce a facial prior expert module with face structure knowledge and a low-rank adaptation module into pre-trained MLLM. We conduct extensive experiments on FABA-Bench and four commonly-used FABA datasets. The results demonstrate that the proposed facial prior expert can boost the performance and EmoLA achieves the best results on our FABA-Bench. On commonly-used FABA datasets, EmoLA is competitive rivaling task-specific state-of-the-art models.

# 196

Strong Double Blind

VideoClusterNet: Self-Supervised and Adaptive Face Clustering for Videos

Devesh Bilwakumar Walawalkar · Pablo Garrido

With the rise of digital media content production, the need for analyzing movies and TV series episodes to locate the main cast of characters precisely is gaining importance. Specifically, Video Face Clustering aims to group together detected video face tracks with common facial identities. This problem is very challenging due to the large range of pose, expression, appearance, and lighting variations of a given face across video frames. Generic pre-trained Face Identification (ID) models fail to adapt well to the video production domain, given its high dynamic range content and also unique cinematic style. Furthermore, traditional clustering algorithms depend on hyperparameters requiring individual tuning across datasets. In this paper, we present a novel video face clustering approach that learns to adapt a generic face ID model to new video face tracks in a fully self-supervised fashion. We also propose a parameter-free clustering algorithm that is capable of automatically adapting to the finetuned model's embedding space for any input video. Due to the lack of comprehensive movie face clustering benchmarks, we also present a first-of-kind movie dataset: MovieFaceCluster. Our dataset is handpicked by film industry professionals and contains extremely challenging face ID scenarios. Experiments show our method's effectiveness in handling difficult mainstream movie scenes on our benchmark dataset and state of-the-art performance on traditional TV series datasets.

# 266

When Do We Not Need Larger Vision Models?

Baifeng Shi · Ziyang Wu · Maolin Mao · Xin Wang · Trevor Darrell

Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S^2), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation. Notably, S^2 achieves state-of-the-art performance in detailed understanding of MLLM on V* benchmark, surpassing models such as GPT-4V. We examine the conditions under which S^2 is a preferred scaling approach compared to scaling on model size. While larger models have the advantage of better generalization on hard examples, we show that features of larger vision models can be well approximated by those of multi-scale smaller models. This suggests most, if not all, of the representations learned by current large pre-trained models can also be obtained from multi-scale smaller models. Our results confirm that a multi-scale smaller model has comparable learning capacity to a larger model, and show that pre-training smaller models with S^2 can match or even exceed the advantage of larger models.

# 264

Strong Double Blind

Open Panoramic Segmentation

Junwei Zheng · Ruiping Liu · Yufan Chen · Kunyu Peng · Chengzhi Wu · Kailun Yang · Jiaming Zhang · Rainer Stiefelhagen

Panoramic images, capturing a 360° field of view (FoV), encompass omnidirectional spatial information crucial for scene understanding. However, it is not only costly to obtain training-sufficient dense-annotated panoramas but also application-restricted when training models in a close-vocabulary setting. To tackle this problem, in this work, we define a new task termed Open Panoramic Segmentation (OPS), where models are trained with FoV-restricted pinhole images with labeled data in an open-vocabulary setting while evaluated with FoV-open panoramic images, enabling the zero-shot open panoramic segmentation ability of models. Moreover, we propose a model named OOOPS with a Deformable Adapter Network (DAN), which significantly improves zero-shot panoramic semantic segmentation performance. To further enhance the distortion-aware modeling ability from the pinhole source domain, we propose a novel data augmentation method called Random Equirectangular Projection (RERP) which is specifically designed to address object deformations in advance. Surpassing other state-of-the-art open-vocabulary semantic segmentation approaches, a remarkable performance boost on three panoramic datasets, WildPASS, Stanford2D3D, and Matterport3D, proves the effectiveness of our proposed OOOPS model with RERP on the OPS task, especially +2.2% on outdoor WildPASS and +2.4% mIoU on indoor Stanford2D3D. The source code will be made publicly available.

# 220

Strong Double Blind

PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

Jiahuan Long · Tingsong Jiang · Wen Yao · Shuai Jia · Weijia Zhang · Weien Zhou · Chao Ma · Xiaoqian Chen

Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.

# 214

Strong Double Blind

Self-Supervised Any-Point Tracking by Contrastive Random Walks

Ayush Shrivastava · Andrew Owens

We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.

# 228

WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing

Shuokang Huang · Kaihan Li · Di You · Yichong Chen · Arvin Lin · Siying Liu · Xiaohui Li · Julie A. McCann

WiFi-based human sensing has exhibited remarkable potential to analyze user behaviors in a non-intrusive and device-free manner, benefiting applications as diverse as smart homes and healthcare. However, most previous works focus on single-user sensing, which has limited practicability in scenarios involving multiple users. Although recent studies have begun to investigate WiFi-based multi-user sensing, there remains a lack of benchmark datasets to facilitate reproducible and comparable research. To bridge this gap, we present WiMANS, to our knowledge, the first dataset for multi-user sensing based on WiFi. WiMANS contains over 9.4 hours of dual-band WiFi Channel State Information (CSI), as well as synchronized videos, monitoring the simultaneous activities of multiple users. We exploit WiMANS to benchmark the performance of state-of-the-art WiFi-based human sensing models and video-based models, posing new challenges and opportunities for future work. We believe WiMANS can push the boundaries of current studies and catalyze the research on WiFi-based multi-user sensing.

# 59

Strong Double Blind

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

Lilang Lin · Lehong Wu · Jiahang Zhang · Jiaying Liu

Generative models, as a powerful technique for generation, also gradually become a critical tool for recognition tasks. However, in skeleton-based action recognition, the features obtained from existing pre-trained generative methods still contain redundant information unrelated to recognition, which contradicts the nature of the skeleton's spatially sparse and temporally consistent properties, leading to undesirable performance. To address this challenge, we make efforts to bridge the gap in theory and methodology and propose a novel skeleton-based idempotent generative model (IGM) for unsupervised representation learning. More specifically, we first theoretically demonstrate the equivalence between generative models and maximum entropy coding, which demonstrates a potential route that makes the features of generative models more compact by introducing contrastive learning. To this end, we introduce the idempotency constraint to form a stronger consistency regularization in the feature space, to push the features only to maintain the critical information of motion semantics for the recognition task. In addition, to avoid the dimensional collapse caused by the generative model's feature space spanned by only principal components, we design an adapter to fuse the features from both the encoder and generator complementarily, boosting the effective feature dimension of the feature space. This approach enriches the feature space, enabling it to capture more comprehensive information. Our extensive experiments on benchmark datasets, NTU RGB+D and PKUMMD, demonstrate the effectiveness of our proposed method. On the NTU 60 xsub dataset, we observe a performance improvement from 84.6$\%$ to 86.2$\%$. Furthermore, in zero-shot adaptation scenarios, our model demonstrates significant efficacy by achieving promising results in cases that were previously unrecognizable.

# 221

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

Yuanming Li · Wei-Jin Huang · An-Lan Wang · Ling-An Zeng · Jing-Ke Meng · WEISHI ZHENG

We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal boundaries are provided to localize single action videos along with sub-steps of each action. More importantly, EgoExo-Fitness introduces innovative annotations for interpretable action judgement--including technical keypoint verification, natural language comments on action execution, and action quality scores. Combining all of these, EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding across dimensions of what'',when'', and ``how well''. To facilitate research on egocentric and exocentric full-body action understanding, we construct benchmarks on a suite of tasks (i.e., action recognition, action localization, cross-view sequence verification, cross-view skill determination, and a newly proposed task of guidance-based execution verification), together with detailed analysis. Data and code are available at https://github.com/iSEE-Laboratory/EgoExo-Fitness/tree/main.

# 203

Strong Double Blind

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

Pulkit Kumar · Namitha Padmanabhan · Luke Luo · Sai Saketh Rambhatla · Abhinav Shrivastava

We propose a simple yet effective approach for few-shot action recognition, emphasizing the disentanglement of motion and appearance representations. By harnessing recent progress in tracking, specifically point trajectories, and self-supervised representation learning, we build trajectory-aligned tokens (TATs) that capture motion and appearance information. This approach significantly reduces the data requirements while retaining essential information. To process these representations, we use a Masked Space-time Transformer that effectively learns to aggregate information to facilitate few-shot action recognition. We demonstrate state-of-the-art results on few-shot action recognition across multiple datasets.

# 202

Strong Double Blind

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

Kang Hyolim · Jeongseok Hyun · Joungbin An · Youngjae Yu · Seon Joo Kim

Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes—a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can address concurrent actions depend heavily on class information, limiting their flexibility. This paper introduces ActionSwitch, the first class-agnostic On-TAL framework capable of detecting overlapping actions. By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations, including overlapping actions of the same class or scenarios where class information is unavailable. This approach is complemented by the proposed "conservativeness loss", which directly embeds a conservative decision-making principle into the loss function for On-TAL. Our ActionSwitch achieves state-of-the-art performance in complex datasets, including Epic-Kitchens 100 targeting the challenging egocentric view and FineAction consisting of fine-grained atomic actions. Code will be made available.

# 200

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Sanjoy Kundu · Shubham Trehan · Sathyanarayanan Aakur

Learning to infer labels in an open world, i.e., in an environment where the target "labels" are unknown, is an important characteristic for achieving autonomy. Foundation models, pre-trained on enormous amounts of data, have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label's search space, i.e., candidate labels provided in the prompt. This target search space can be unknown or exceptionally large in an open world, severely restricting their performance. To tackle this challenging problem, we propose a two-step, neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego) demonstrate its performance on open-world activity inference. We also show that ALGO can be extended to zero-shot inference and demonstrate its competitive performance. Code and additional qualitative analysis are provided as part of the supplementary and will be publicly available after review.

# 224

Strong Double Blind

OMR: Occlusion-Aware Memory-Based Refinement for Video Lane Detection

Dongkwon Jin · Chang-Su Kim

A novel algorithm for video lane detection is proposed in this paper. First, we extract a feature map for a current frame and detect a latent mask for obstacles occluding lanes. Then, we enhance the feature map by developing an occlusion-aware memory-based refinement (OMR) module. It takes the obstacle mask and feature map from the current frame, previous output, and memory information as input, and processes them recursively in a video. Moreover, we apply a novel data augmentation scheme for training the OMR module effectively. Experimental results show that the proposed algorithm outperforms existing techniques on video lane datasets. The source codes will be made publicly available.

# 207

Improving Video Segmentation via Dynamic Anchor Queries

Yikang Zhou · Tao Zhang · Xiangtai Li · Shunping Ji · Shuicheng Yan

Modern video segmentation methods adopt object queries to perform inter-frame association and demonstrate satisfactory performance in tracking continuously appearing objects despite large-scale motion and transient occlusion. However, they all underperform on newly emerging and disappearing objects that are common in the real world because they attempt to model object emergence and disappearance through feature transitions between background and foreground queries that have significant feature gaps. We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries by dynamically generating anchor queries based on the features of potential candidates. Furthermore, we introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes the potential of DAQ without any additional cost. Finally, we combine our proposed DAQ and EDS with DVIS~\cite{zhang2023dvis} to obtain DVIS-DAQ. Extensive experiments demonstrate that DVIS-DAQ achieves a new SOTA performance on five mainstream video segmentation benchmarks. Code and models will be available for further study.

# 201

VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement

Hanjung Kim · Jaehyun Kang · Miran Heo · Sukjun Hwang · Seoung Wug Oh · Seon Joo Kim

In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects. This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities. Therefore, we suggest a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects, which greatly enhances instance association accuracy. Furthermore, recognizing the limitations of existing benchmarks in fully evaluating appearance awareness, we have constructed a synthetic dataset to rigorously validate our method. By effectively resolving the over-reliance on location information, we achieve state-of-the-art results on YouTube-VIS 2019/2021 and Occluded VIS (OVIS). Code is available at https://github.com/KimHanjung/VISAGE.

# 222

Merlin: Empowering Multimodal LLMs with Foresight Minds

En Yu · liang zhao · YANA WEI · Jinrong Yang · Dongming Wu · Lingyu Kong · Haoran Wei · Tiancai Wang · Zheng Ge · Xiangyu Zhang · Wenbing Tao

Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin’s foresight minds with impressive performance on both future reasoning and visual comprehension tasks.

# 195

Strong Double Blind

STSP: Spatial-Temporal Subspace Projection for Video Class-incremental Learning

Hao CHENG · SIYUAN YANG · Chong Wang · Joey Tianyi Zhou · Alex Kot · Bihan Wen

Video class-incremental learning (VCIL) aims to learn discriminative and generalized feature representations for video frames to mitigate catastrophic forgetting. Conventional VCIL approaches often retain a subset of frames or features from prior tasks as exemplars for subsequent incremental learning stages. However, these strategies overlook the connection between base and novel classes, sometimes even leading to privacy leakage. To overcome this challenge, we introduce a Spatial-Temporal Subspace Projection (STSP) scheme for VCIL. Specifically, we propose a discriminative Temporal-based Subspace Classifier (TSC) that represents each class with an orthogonal subspace basis and adopts subspace projection loss for classification. Differing from typical classification methods that rely on fully connected layers, our TSC is designed to discern the spatial-temporal dynamics in video content, thereby enhancing the representation of each video sample. Additionally, we implement inter- and intra-class orthogonal constraints into TSC, ensuring that each class occupies a unique orthogonal subspace, defined by its basis. To prevent catastrophic forgetting, we further employ a Spatial-based Gradient Projection (SGP) strategy. SGP adjusts the gradients of the network parameters to align with the approximate null space of the spatial feature set from previous tasks. Extensive experiments conducted on three benchmarks, namely HMDB51, UCF101, and Something-Something V2, demonstrate that our STSP method outperforms state-of-the-art comparison methods,

# 191

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng · Yujie Zhong · Chengjian Feng · Lin Ma

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet.

# 185

Strong Double Blind

Contextual Correspondence Matters: Bidirectional Graph Matching for Video Summarization

yunzuo zhang · Yameng Liu

Video summarization plays a vital role in improving video browsing efficiency and has various applications in action recognition and information retrieval. In order to generate summaries that can provide key information, existing works have been proposed to simultaneously explore the contribution of both long-range and short-range temporal cues. However, they rarely consider the potential correspondence between temporal cues at different granularity within video sequences, making it insufficient to ensure detailed video understanding. In order to solve this issue, we propose a novel video summarization framework, namely Bgm4Video, based on the graph-matching mechanism, which models the potential contextualized relationship across multi-granularity temporal cues. The proposed framework is composed of two dominant components including (i) a temporal encoder (TE) that explores both coarse-grained and fine-grained contextual information within videos and (ii) a bidirectional graph transmission (BGT) module that models the interrelationship across multi-granularity temporal cues. By grasping the contextual correspondence, our method allows for further refining temporal representations to precisely pinpoint valuable segments. We demonstrate the advantage of our components through an extensive ablation study. We also evaluate our full approach on the video summarization task and demonstrate improvements over state-of-the-art on the popular benchmarks.

# 186

Strong Double Blind

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Yang Jin · Yadong Mu

This paper explores the spatio-temporal video grounding (STVG) task, which aims at localizing a particular object corresponding to a given textual description in an untrimmed video. Existing approaches mainly resort to object-level manual annotations as the supervision for addressing this challenging task. Such a paradigm heavily constrains the scalability of processing large-scale unlabeled data. To this end, we present a novel framework that is capable of grounding the target object relying only on the video-sentence correspondence. Specifically, our model re-formulates the original STVG task as two cross-modal alignment sub-problems: region-phrase and frame-sentence. Since the absence of ground-truth alignments during the training stage, we treat them as latent variables and learn to model the joint conditional distribution by reconstructing the interactions of entities in the video. The entire framework can be effectively optimized by the variational Expectation-Maximization (EM) algorithm, which alternates between two updating steps for progressively maximizing the likelihood of query sentence, thereby approximating the real cross-modal assignment. Extensive experiments on two video benchmarks (VidSTG and HC-STVG) further show the effectiveness of the proposed method.

# 193

Strong Double Blind

AMEGO: Active Memory from long EGOcentric videos

Gabriele Goletto · Tushar Nagarajan · Giuseppe Averta · Dima Damen

Egocentric videos provide a unique perspective into individuals' daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human's ability to memorise information from a single watching, our method focuses on constructing self-contained representations from the egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new Active Memories Benchmark (AMB), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of AMEGO on AMB, surpassing other video QA baselines by a substantial margin.

# 184

Strong Double Blind

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

Xiang Fang · Zeyu Xiong · Wanlong Fang · Xiaoye Qu · Chen Chen · Jianfeng Dong · Keke Tang · Pan Zhou · Yu Cheng · Daizong Liu

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment candidate selection pipeline that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moments. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: (1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. (2) Complex moment candidates: the performance of these methods severely relies on the quality of moment candidates, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each frame-word pair with diverse granularity and flexible combination for fine-grained cross-modal interaction. Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. At last, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for fine-grained moment boundary grounding. Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

# 192

Strong Double Blind

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

Huabin Liu · Xiao Ma · Cheng Zhong · Yang Zhang · Weiyao Lin

Video reasoning typically operates within the Video Question-Answering (VQA) paradigm, which demands that the models understand and reason about video content from temporal and causal perspectives. Traditional supervised VQA methods gain this capability through meticulously annotated QA datasets, while advanced visual-language models exhibit remarkable performance due to large-scale visual-text pretraining data. Nevertheless, due to potential language bias and spurious visual-text correlations in cross-modal learning, concerns about the reliability of their answers persist in real-world applications. In this paper, we focus on the grounded VQA task, which necessitates models to provide answers along with explicit visual evidence, i.e., certain video segments. As temporal annotation is not available during training, we propose a novel bi-directional reasoning framework to perform grounded VQA in a weakly-supervised setting. Specifically, our framework consists of two parallel but dual reasoning paths. They conduct temporal grounding and answering based on the video content, approaching it from two dual directions that are symmetrical in terms of temporal order or causal relationships. By constructing a cycle-consistency relationship between these two branches, the model is prompted to provide self-guidance supervision for both temporal grounding and answering. Experiments conducted on the Next-GQA and Env-QA datasets demonstrate that our framework achieves superior performance in grounded VQA and can provide reasonable temporal locations that validate the answers.

# 187

Strong Double Blind

Delving Deep into Engagement Prediction of Short Videos

dasong Li · Wenjie Li · Baili Lu · Hongsheng LI · Sizhuo Ma · Gurunandan Krishnan · Jian Wang

Understanding and modeling the popularity of User Generated Content (UGC) short videos on social media platforms presents a critical challenge with broad implications for content creators and recommendation systems. This study delves deep into the intricacies of predicting engagement for newly published videos with limited user interactions. Surprisingly, our findings reveal that Mean Opinion Scores from previous video quality assessement datasets do not strongly correlate with video engagement levels. To address this, we introduce a substantial dataset comprising 90,000 real-world UGC short videos from Snapchat. Rather than relying on view count, average watch time, or rate of likes, we propose two metrics: normalized average watch percentage (NAWP) and engagement continuation rate (ECR) to describe the engagement levels of short videos. Comprehensive multi-modal features, including visual content, background music, and text data, are investigated to enhance engagement prediction. With the proposed dataset and two key metrics, our method demonstrates its ability to predict engagements of short videos purely from video content.

# 188

LITA: Language Instructed Temporal-Localization Assistant

De-An Huang · Shijia Liao · Subhashree Radhakrishnan · (Danny) Hongxu Yin · Pavlo Molchanov · Zhiding Yu · Jan Kautz

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding.

# 182

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

Faegheh Sardari · Armin Mustafa · Philip JB Jackson · Adrian Hilton

Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging both unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively. Our code will be released upon paper publication.

# 169

Siamese Vision Transformers are Scalable Audio-visual Learners

Yan-Bo Lin · Gedas Bertasius

Traditional audio-visual methods rely on independent audio and visual backbones, which is costly and not scalable. In this work, we investigate using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining. Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes. We pretrain our model using a contrastive audio-visual matching objective with a multi-ratio random masking scheme, which further speeds up the audio-visual pretraining process and enables our model to process larger audio-visual instance batches, helpful for contrastive learning. Unlike prior audio-visual methods, our method can robustly handle audio-only, visual-only, and audio-visual inputs with a single shared ViT backbone. Furthermore, despite using the shared backbone for both modalities, AVSiam achieves competitive or even better results than prior methods on AudioSet and VGGSound for audio-visual classification and audio-visual retrieval. Our code and models will be made publicly available.

# 197

Strong Double Blind

EvSign: Sign Language Recognition and Translation with Streaming Events

Pengyu Zhang · Hao Yin · Zeren Wang · Wenyue Chen · Sheng Ming Li · Dong Wang · Huchuan Lu · XU JIA

Sign language is one of the most effective communication tools for people with hearing difficulties. Most existing works focus on improving the performance of sign language tasks on RGB videos, which may suffer from degraded recording conditions, such as fast movement of hands with motion blur and textured signer’s appearance. The bio-inspired event camera, which asynchronously captures brightness change with high speed, could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks. In this work, we aim at exploring the potential of event camera in continuous sign language recognition (CSLR) and sign language translation (SLT). To promote the research, we first collect an event-based benchmark EvSign for those tasks with both gloss and spoken language annotations. EvSign dataset offers a substantial amount of high-quality event streams and an extensive vocabulary of glosses and words, thereby facilitating the development of sign language tasks. In addition, we propose an efficient transformer-based framework for event-based SLR and SLT tasks, which fully leverages the advantages of streaming events. The sparse backbone is employed to extract visual features from sparse events. Then, the temporal coherence is effectively utilized through the proposed local token fusion and gloss-aware temporal aggregation modules. Extensive experimental results are reported on both simulated (PHOENIX14T) and EvSign datasets. Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost (0.84G FLOPS per video) and 44.2% network parameters. The project is available at https://zhang-pengyu.github.io/EVSign.

# 225

Strong Double Blind

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Quan Kong · Yuki Kawana · Rajat Saini · Ashutosh Kumar · Jingjing Pan · Ta Gu · Yohei Ozao · Balazs Opra · Yoichi Sato · Norimasa Kobori

In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in over hundreds traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also provide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development. Dataset page: https://woven-visionai.github.io/wts-dataset-homepage/.

# 144

Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Jiangming Shi · Xiangbo Yin · Yeyun Chen · Yachao Zhang · Zhizhong Zhang · Yuan Xie · Yanyun Qu

Unsupervised visible-infrared person re-identification (USL-VI-ReID) is a promising yet highly challenging retrieval task. The key challenges in USL-VI-ReID are to accurately generate pseudo-labels and establish pseudo-label correspondences across modalities without relying on any prior annotations. Recently, clustered pseudo-label methods have gained more attention in USL-VI-ReID. However, most existing methods don't fully exploit the intra-class nuances, as they simply utilize a single memory that represents an identity to establish cross-modality correspondences, resulting in noisy cross-modality correspondences. To address the problem, we propose a Multi-Memory Matching (MMM) framework for USL-VI-ReID. We first design a simple yet effective Cross-Modality Clustering (CMC) module to generate the pseudo-labels through clustering together both two modality samples. To associate cross-modality clustered pseudo-labels, we design a Multi-Memory Learning and Matching (MMLM) module, ensuring that optimization explicitly focuses on the nuances of individual perspectives and establishes reliable cross-modality correspondences. Finally, we design a Soft Cluster-level Alignment (SCA) loss to narrow the modality gap while mitigating the effect of noisy pseudo-labels through a soft many-to-many alignment strategy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the reliability of the established cross-modality correspondences and the effectiveness of MMM. The source codes will be released.

# 3

Strong Double Blind

Masked Angle-Aware Autoencoder for Remote Sensing Images

Zhihao Li · Biao Hou · Siteng Ma · zitong wu · Xianpeng Guo · bo ren · Licheng Jiao

To overcome the inherent domain gap between remote sensing (RS) images and natural images, some self-supervised representation learning methods have made promising progress. However, they have overlooked the diverse angles present in RS objects. This paper proposes the Masked Angle-Aware Autoencoder (MA3E) to perceive and learn angles during pre-training. We design a scaling center crop operation to create the rotated crop with random orientation on each original image, introducing the explicit angle variation. MA3E inputs this composite image while reconstruct the original image, aiming to effectively learn rotation-invariant representations by restoring the angle variation introduced on the rotated crop. To avoid biases caused by directly reconstructing the rotated crop, we propose an Optimal Transport (OT) loss that automatically assigns similar original image patches to each rotated crop patch for reconstruction. MA3E demonstrates more competitive performance than existing pre-training methods on seven different RS image datasets in three downstream tasks.

# 244

Strong Double Blind

Revisit Anything: Visual Place Recognition via Image Segment Retrieval

Kartik Garg · Sai Shubodh Puligilla · Shishir N Y Kolathaya · Madhava Krishna · Sourav Garg

Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the whole' image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching forimage segments' instead of the whole images. We propose to use open-set image segmentation to decompose an image into meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable toboth' generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. We will make the source code publicly available.

# 206

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Fangwei Zhong · Kui Wu · Hai Ci · Chu-ran Wang · Hao Chen

Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent’s egocentric vision. This is a vital and challenging skill for embodied vision agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ``Tracking Anything”, to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online agent-environment interactions. To improve the training efficiency, robustness, and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train the tracker within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. Such efficiency is unprecedented for RL-based visual tracking methods. We evaluate our tracker on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets.

# 190

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang · Junpeng Yue · Hao Luo · gang ding · Zongqing Lu

One of the essential missions in the AI research community is to build an autonomous embodied agent that can achieve high-level performance across a wide spectrum of tasks. However, acquiring or manually designing rewards for all open-ended tasks is unrealistic. In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC, aiming to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. Simply utilizing the similarity between the video snippet and the language prompt is not RL-friendly since standard VLMs may only capture the similarity at a coarse level. To achieve RL-friendliness, we incorporate the task completion degree into the VLM training objective, as this information can assist agents in distinguishing the importance between different states. Moreover, we provide neat YouTube datasets based on the large-scale YouTube database provided by MineDojo. Specifically, two rounds of filtering operations guarantee that the dataset covers enough essential information and that the video-text pair is highly correlated. Empirically, we demonstrate that the proposed method achieves better performance on RL tasks compared with baselines.

# 210

Strong Double Blind

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Xinyu Xu · Shengcheng Luo · Yanchao Yang · Yong-Lu Li · Cewu Lu

Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically explored on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark, of large-scale long-horizon vision-language navigation and interaction tasks, as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6\% success rate margin in unseen scenes, even without step-by-step instructions. Our code and model will be made publicly available.

# 194

See and Think: Embodied Agent in Virtual Environment

Zhonghan Zhao · Xuan Wang · Wenhao Chai · Boyi Li · Shengyu Hao · Shidong Cao · Tian Ye · Gaoang Wang

Large language models (LLMs) have achieved impresx0002sive progress on several open-world tasks. Recently, usx0002ing LLMs to build embodied agents has been a hotspot. In this paper, we propose STEVE, a comprehensive and visionx0002ary embodied agent in the Minecraft virtual environment. STEVE consists of three key components: vision perception, language instruction, and code action. Vision perception involves the interpretation of visual information in the envix0002ronment, which is then integrated into the LLMs component with agent state and task instruction. Language instrucx0002tion is responsible for iterative reasoning and decomposx0002ing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in skill database, enabling the agent to interact effectively within the Minecraft environment. We also collect STEVE-21K dataset, which includes 600+ vision-environment pairs, 20K knowledge question-answering pairs, and 200+ skillx0002code pairs. We conduct continuous block search, knowledge question and answering, and tech tree mastery to evaluate the performance. Extensive experiments show that STEVE achieves at most 1.5× faster unlocking key tech trees and 2.5× quicker in block search tasks compared to previous state-of-the-art methods.

# 212

Strong Double Blind

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

Ginger Delmas · Philippe Weinzaepfel · Francesc Moreno · Grégory Rogez

Aligning multiple modalities in a latent space, such as images and texts, has shown to produce powerful semantic visual representations, fueling tasks like image captioning, text-to-image generation, or image grounding. In the context of human-centric vision, albeit these CLIP-like representations encode most standard human poses relatively well (such as standing or sitting), they lack sufficient acuteness to discern detailed or uncommon ones. Actually, while 3D human poses have been often associated with images (e.g. to perform pose estimation or pose-conditioned image generation), or more recently with text (e.g. for text-to-pose generation), they have seldom been paired with both. In this work, we combine 3D poses, person's pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. When composing modalities, it outperforms a standard multi-modal alignment retrieval model, making it possible to sort out partial information (e.g. image with the lower body occluded). We showcase the potential of such an embroidered pose representation on the task of fine-grained instruction generation, which consists in generating a text that describes how to move from one 3D pose to another (as a fitness coach). Unlike prior works, it can be applied directly on any kind of input (image and/or pose) without retraining.

# 177

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Fucai Ke · Zhixi Cai · Simindokht Jahangard · Weiqing Wang · Pari Delir Haghighi · Hamid Rezatofighi

Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.

# 170

Strong Double Blind

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Mingyu Zhang · Jiting Cai · Mingyu Liu · YUE XU · Cewu Lu · Yong-Lu Li

Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation on diverse benchmarks, we demonstrate the shortcomings of existing methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. Our two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning. Our code will be publicly available.

# 163

Strong Double Blind

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Penglei SUN · Yaoxian Song · Xinglin Pan · Peijie Dong · Xiaofei Yang · Qiang Wang · Zhixu Li · Tiefeng Li · Xiaowen Chu

The existing works on object-level language grounding with 3D objects mostly focus on improving performance by utilizing the off-the-shelf pre-trained models to capture features, such as viewpoint selection or geometric priors. However, they have failed to consider exploring the cross-modal representation of language-vision alignment in the cross-domain field. To answer this problem, we propose a novel method called Domain Adaptation for Language Grounding (DA4LG) with 3D objects. Specifically, the proposed DA4LG consists of a visual adapter module with multi-task learning to realize vision-language alignment by comprehensive multimodal feature representation. Experimental results demonstrate that DA4LG competitively performs across visual and non-visual language descriptions, independent of the completeness of observation. DA4LG achieves state-of-the-art performance in the single-view setting and multi-view setting with the accuracy of 83.8 % and 86.8 % respectively in the language grounding benchmark SNARE. The simulation experiments show the well-practical and generalized performance of DA4LG compared to the existing methods. Our project is available anonymously at https://sites.google.com/view/da4lg.

# 172

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang · Dongzhi Jiang · Yichi Zhang · Haokun Lin · Ziyu Guo · Pengshuo Qiu · Aojun Zhou · Pan Lu · Kai-Wei Chang · Peng Gao · Hongsheng LI

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging true or false, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then assess each step with error analysis to derive a total score, which can reveal the inner CoT reasoning quality by MLLMs. With MathVerse, we unveil that, most existing MLLMs struggle to understand math diagrams, relying heavily on textual questions. Surprisingly, some of them even achieve 5%+ higher accuracy without the visual input, e.g., Gemini-Pro and SPHINX-MoE. In contrast, GPT-4V and InternLM-XComposer2 demonstrate relatively better comprehension of the visual content for mathematical reasoning. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs.

# 166

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Haibo Wang · Weifeng Ge

With the breakthrough of multi-modal large language models, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues of each instance in the given image, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these important and rich visual clues by mining question-answer pairs in images and sending them into multi-modal large language models as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal large language models to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.

# 165

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang · Hongyang Li · Feng Li · Tianhe Ren · Xueyan Zou · Shilong Liu · Shijia Huang · Jianfeng Gao · Lei Zhang · Chunyuan Li · Jianwei Yang

With the recent significant advancements in large multimodal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities.

# 164

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Haoqin Tu · Chenhang Cui · Zijun Wang · Yiyang Zhou · Bingchen Zhao · Junlin Han · Wangchunshu Zhou · Huaxiu Yao · Cihang Xie

This work focuses on the potential of vision large language models (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite Unicorn, covering out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel visual question-answering (VQA) datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language input of VLLMs. Our evaluation of 22 diverse models, ranging from open-source VLLMs to GPT-4V and Gemini Pro, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.

# 156

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu · Yichen Zhu · Jindong Gu · Yunshi Lan · Chao Yang · Yu Qiao

Warning: This paper contains examples of harmful language and images, and reader discretion is recommended. The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits.

# 162

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

Sensen Gao · Xiaojun Jia · Xuhong Ren · Ivor Tsang · Qing Guo

Vision-language pre-training (VLP) models exhibit remarkable capabilities in comprehending both images and text, yet they remain susceptible to multimodal adversarial examples (AEs). Strengthening adversarial attacks and uncovering vulnerabilities, especially common issues in VLP models (e.g., high transferable AEs), can stimulate further research on constructing reliable and practical VLP models. A recent work (i.e., Set-level guidance attack) indicates that augmenting image-text pairs to increase AE diversity along the optimization path enhances the transferability of adversarial examples significantly. However, this approach predominantly emphasizes diversity around the online adversarial examples (i.e., AEs in the optimization period), leading to the risk of overfitting the victim model and affecting the transferability. In this study, we posit that the diversity of adversarial examples towards the clean input and online AEs are both pivotal for enhancing transferability across VLP models. Consequently, we propose using diversification along the intersection region of adversarial trajectory to expand the diversity of AEs. To fully leverage the interaction between modalities, we introduce text-guided adversarial example selection during optimization. Furthermore, to further mitigate the potential overfitting, we direct the adversarial text deviating from the last intersection region along the optimization path, rather than adversarial images as in existing methods. Extensive experiments affirm the effectiveness of our method in improving transferability across various VLP models and downstream vision-and-language tasks (e.g., Image-Text Retrieval(ITR), Visual Grounding(VG), Image Captioning(IC)).

# 304

Strong Double Blind

Object-Oriented Anchoring and Modal Alignment in Multimodal Learning

Shibin Mei · Bingbing Ni · Hang Wang · Chenglong Zhao · fengfa hu · Zhiming Pi · BiLian Ke

Modality alignment has been of paramount importance in recent developments of multimodal learning, which has inspired many innovations in multimodal networks and pre-training tasks. Single-stream networks can effectively leverage self-attention mechanisms to facilitate modality interactions but suffer from high computational complexity and limited applicability to downstream retrieval tasks. In contrast, dual-stream networks address these issues but ignore the significance of modality alignment. In this paper, we develop a multimodal learning method that integrates the advantages of modality alignment from single-stream networks into the dual-stream network by introducing object-oriented anchors to bridge alignment between image and text modalities. Object-oriented anchors are generated effectively and circumvent the need for object detection boxes as previous region-based approaches, while also preserving explicit semantics for modality interactions. Additionally, we design fine-grained token-level asymmetry alignment between modalities and multiview mining to promote modality alignment. To the best of our knowledge, we are the first to apply object-oriented tokens in multimodal pre-training, yielding significant benefits. Extensive experimental results validate the effectiveness of our method, demonstrating that the proposed method outperforms most previous arts in various downstream tasks, particularly when considering comparable data and model scales.

# 167

Strong Double Blind

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Wei Chen · Long Chen · Yu Wu

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach.

# 158

Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

Minchan Kim · Minyeong Kim · Junik Bae · Suhwan Choi · Sungkyung Kim · Buru Chang

Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs.

# 154

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Tingyu Qu · Tinne Tuytelaars · Marie-Francine Moens

Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model's hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. The routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20\% improvement on VQAv2 ($\text{RoBERTa}_{\text{large}}$+ViT-L/16) and 30\% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks.

# 160

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Bowen Shi · Peisen Zhao · Zichen Wang · Yuhang Zhang · Yaoming Wang · Jin Li · Wenrui Dai · Junni Zou · Hongkai Xiong · Qi Tian · Xiaopeng Zhang

Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models.

# 104

ReGround: Improving Textual and Spatial Grounding at No Cost

Phillip (Yuseung) Lee · Minhyuk Sung

When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.

# 132

Strong Double Blind

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

Peng Wang · Zhaohai Li · Jun Tang · Humen Zhong · Fei Huang · Zhibo Yang · Cong Yao

Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.

# 128

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Beichen Zhang · Pan Zhang · Xiaoyi Dong · Yuhang Zang · Jiaqi Wang

Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient finetuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, eg, COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner. All code and models will be publicly available.

# 102

Strong Double Blind

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Jianjie Luo · Jingwen Chen · Yehao Li · Yingwei Pan · Jianlin Feng · Hongyang Chao · Ting Yao

Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.

# 103

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

Tatiana Gaintseva · Martin Benning · Greg Slabaugh

In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

# 137

Tokenize Anything via Prompting

Ting Pan · Lulu Tang · Xinlong Wang · Shiguang Shan

We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, e.g., SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 164.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of visual perception tasks.

# 268

Strong Double Blind

FuseTeacher: Modality-fused Encoders are Strong Vision Supervisors

Chen-Wei Xie · Siyang Sun · Liming Zhao · Pandeng Li · Shuailei Ma · Yun Zheng

Learning visual representation with image-text datasets attracts a lot of attention in recent years. Existing approaches primarily rely on cross-modality supervision, and incorporate intra-modality supervision if necessary. They overlook the potential benefits of modality-fused supervision. Since modality-fused representation augments the image representation with textual information, we conjecture it is more discriminative and potential to be a strong teacher for visual representation learning. In this paper, we validate this hypothesis by experiments and propose a novel method FuseTeacher that learns visual representation by modality-fused supervision. Specifically, we introduce a fusion encoder that encodes image and text into a fusion representation. This representation can be utilized to supervise the visual representation learning in two distillation ways: (i) Classification Distillation: we cluster image-text pairs into K clusters using the fusion representation and assign each pair a soft cluster assignment, which is served as a pseudo classification label for supervising the image encoder. (ii) Retrieval Distillation: we calculate the similarities between the fusion representation and all text representations in the same batch. By using the similarity distribution as pseudo retrieval similarity between the corresponding image and all texts, we can prevent one-to-one contrastive learning from separating relevant but unpaired pairs. The FuseTeacher is compatible with existing language supervised visual representation learning methods. Experimental results demonstrate that it is able to bring significant improvements and achieves state-of-the-art methods on various datasets. Our code, datasets and pre-trained models will be released.

# 254

Strong Double Blind

De-confounded Gaze Estimation

Ziyang Liang · Yiwei Bao · Feng Lu

Deep-learning based gaze estimation methods suffer from sever performance degradation in cross-domain settings. One of the primary reason is that the gaze estimation model is confounded by gaze-irrelevant factor during estimation, such as identity and illumination. In this paper, we propose to tackle this problem by causal intervention, an analytical tool that alleviates the impact of confounding factors by using intervening the distribution of confounding factors. Concretely, we propose the Feature-Separation-based Causal Intervention (FSCI) framework for generalizable gaze estimation. The FSCI framework first separates gaze features from gaze-irrelevant features. To alleviate the impact of gaze-irrelevant factors during training, the FSCI framework further implements causal intervention by averaging gaze-irrelevant features using the proposed Dynamic Confounder Bank strategy. Experiments show that the proposed FSCI framework outperforms SOTA gaze estimation methods in varies cross-domain settings, improving the cross-domain accuracy of the baseline up to 36.2% without touching target domain data.

# 283

Strong Double Blind

GalLop: Learning global and local prompts for vision-language models

Marc Lafon · Elias Ramzi · Clément Rambour · Nicolas Audebert · Nicolas THOME

Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout'' technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced.

# 155

Strong Double Blind

OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

Changsheng Lu · Zheyuan Liu · Piotr Koniusz

Exploiting the foundation models (\eg, CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (\eg, the nose of a cat''), or the visual prompt (\eg, support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on taking multimodal prompt is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts likeCan you detect the nose and ears of a cat?'' In this work, we open the prompt diversity from three aspects: modality, semantics (seen \vs unseen), and language, to enable a more generalized zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated from visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also found large language model (LLM) is a good parser, which achieves over 96\% accuracy to parse keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways to deal with unseen text and diverse texts.

# 159

Strong Double Blind

CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection

Shuang Hao · Chunlin Zhong · He Tang

The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core components. 1) Language-driven Quality Assessment (LQA): Leveraging a pretrained vision-language model with a prompt learner, the LQA recalibrates image contributions without requiring additional quality annotations. This approach effectively mitigates the impact of noisy inputs. 2) Conditional Dropout (CD): A learning method to strengthen the model's adaptability in scenarios with missing modalities, while preserving its performance under complete modalities. The CD serves as a plug-in training scheme that treats modality-missing as conditions, strengthening the overall robustness of various dual-modal SOD models. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions.

# 157

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan · Yousong Zhu · Zhiyang Chen · Fan Yang · Ming Tang · Jinqiao Wang

Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Large Vision Language Models (LVLMs). Current LVLMs are predominantly constrained to locate a single, pre-existing object. This limitation leads to a compromise in model design, necessitating the introduction of visual expert models or the customized head structures. Beyond these constraints, our research uncovers LVLMs' capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that Griffon not only achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickrs30K Entities but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO. Dataset, codes and models will be released.

# 65

Strong Double Blind

Can OOD Object Detectors Learn from Foundation Models?

Jiahui Liu · Xin Wen · Shizhen Zhao · Yingxian Chen · Qi Xiaojuan

Out-of-distribution (OOD) object detection is a challenging task due to the absence of open-set OOD data. Inspired by recent advancements in text-to-image generative models, such as Stable Diffusion, we study the potential of generative models trained on large-scale open-set data to synthesize OOD samples, thereby enhancing OOD object detection. We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundational models to automatically extract meaningful OOD data from text-to-image generative models. This offers the model access to open-world knowledge encapsulated within off-the-shelf foundational models. The synthetic OOD samples are then employed to augment the training of a lightweight, plug-and-play OOD detector, thus effectively optimizing the in-distribution (ID)/OOD decision boundaries. Extensive experiments across multiple benchmarks demonstrate that SyncOOD significantly outperforms existing methods, establishing new state-of-the-art performance with minimal synthetic data usage. The code and data will be publicly available.

# 238

Strong Double Blind

VEON: Vocabulary-Enhanced Occupancy Prediction

Jilai Zheng · Pin Tang · Zhongdao Wang · Guoqing Wang · Xiangxuan Ren · Bailan Feng · Chao Ma

Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any type of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces coarse relative depth and fails to estimate the one-hot bin depth. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class-reweighting strategy to give priority to the tail classes. With only 46.2M trainable parameters and no manual labels, VEON achieves 15.14 mIoU on Occ3D-NuScenes, and also shows the capability of recognizing objects with open-vocabulary categories, demonstrating that our VEON is label and parameter efficient, and precise enough.

# 174

Strong Double Blind

Efficient Vision Transformers with Partial Attention

Xuan-Thuy Vo · Duy-Linh Nguyen · Adri Priadana · Kang-Hyun Jo

As a core of Vision Transformer (ViT), self-attention has high flexibility in modeling long-range dependencies because every query attends to all spatial locations. Although ViT achieves promising performance in visual tasks, self-attention's complexity is quadratic with token lengths. This leads to challenging problems when transferring ViT models to dense prediction tasks that require high input resolutions. Previous arts have tried to solve this problem by introducing sparse attention such as spatial reduction attention, and window attention. One common point of these methods is that all image/window tokens are joined during computing attention weights. In this paper, we find out that there exist high similarities between attention weights and incur computation redundancy. To address this issue, this paper proposes novel attention, called partial attention, that learns spatial interactions more efficiently, by reducing redundant information in attention maps. Each query in our attention only interacts with a small set of relevant tokens. Based on partial attention, we design an efficient and general vision Transformer, named PartialFormer, that attains good trade-offs between accuracy and computational costs across vision tasks. For example, on ImageNet-1K, PartialFormer-B3 outperforms Swin-T by 1.7% Top-1 accuracy while saving 25% GFLOPs, and Focal-T by 0.8% while saving 30% GFLOPs.

# 127

Strong Double Blind

SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Sayan Nag · Koustava Goswami · Srikrishna Karanam

Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SAFARI achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SAFARI also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks.

# 138

ReMamber: Referring Image Segmentation with Mamba Twister

Yuhuan Yang · Chaofan Ma · Jiangchao Yao · Zhun Zhong · Ya Zhang · Yanfeng Wang

Referring Image Segmentation (RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it difficult in capturing long-range visual-language dependencies, which is particularly important for the context of large-size images with long textual descriptions. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose \methodname, a novel RIS architecture that integrates the efficiency of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve state-of-the-art on all three benchmarks. Moreover, we conduct thorough analyses of \methodname and discuss other fusion designs using Mamba. These provide valuable perspectives for future research. The code will be released upon publication.

# 126

Strong Double Blind

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Zixiao Wang · Hongtao Xie · YuXin Wang · Yadong Qu · Fengjun Guo · Pengwei Liu

Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available.

# 54

A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties

Junfei Xiao · Ziqi Zhou · Wenxuan Li · Shiyi Lan · Jieru Mei · Zhiding Yu · Bingchen Zhao · Alan Yuille · Yuyin Zhou · Cihang Xie

This paper introduces ProLab, a novel approach using property-level label space for creating strong interpretable segmentation models. Instead of relying solely on category-specific annotations, ProLab uses descriptive properties grounded in common sense knowledge for supervising segmentation models. It is based on two core designs. First, we employ Large Language Models (LLMs) and carefully crafted prompts to generate descriptions of all involved categories that carry meaningful common sense knowledge and follow a structured format. Second, we introduce a description embedding model preserving semantic correlation across descriptions and then cluster them into a set of descriptive properties (e.g., 256) using K-Means. These properties are based on interpretable common sense knowledge consistent with theories of human recognition. We empirically show that our approach makes segmentation models perform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal Context, Cityscapes, and BDD). Our method also shows better scalability with extended training steps than category-level supervision. Our interpretable segmentation framework also emerges with the generalization ability to segment out-of-domain or unknown categories using only in-domain descriptive properties. Codes will be publicly available.

# 105

Strong Double Blind

Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets

Qin Lei · Jiang Zhong · Qizhu Dai

Curvilinear object segmentation plays a crucial role across various applications, yet datasets in this domain often suffer from small scale due to the high costs associated with data acquisition and annotation. To address these challenges, this paper introduces a novel approach for expanding curvilinear object segmentation datasets, focusing on enhancing the informativeness of generated data and the congruence between semantic maps and generated images. Our method enriches synthetic data informativeness by generating curvilinear objects through their multiple textual features. By combining textual features from each sample in original dataset, we obtain synthetic images that beyond the original dataset's distribution. This initiative necessitated the creation of the Curvilinear Object Segmentation based on Text Generation (COSTG) dataset. Designed to surpass the limitations of conventional datasets, COSTG incorporates not only standard semantic maps but also some textual descriptions of curvilinear object features. To ensure the congruence between synthetic semantic maps and images, we adapted ControlNet with Spatially-Adaptive Normalization (SPADE), allowing it to preserve semantic information that would typically be washed away in normalization layers. This modification facilitates more accurate semantic image synthesis. Experimental results demonstrate the efficacy of our approach across three types of curvilinear objects (angiography, crack and retina) and five public datasets (CHUAC, XCAD, DCA1, DRIVE, CHASEDB1 and Crack500). The synthetic data generated using our method not only expand the dataset, but also effectively improves the performance of other curvilinear object segmentation models.

# 124

Strong Double Blind

Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation

Seongsu Ha · Chaeyun Kim · Donghwa Kim · Junho Lee · Sangho Lee · Lee Joonseok

TBA

# 263

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Haodi He · Colton Stearns · Adam Harley · Leonidas Guibas

Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of “coarse” or “fine” granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on a synthetic dataset with multi-view images and multi granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model’s 3D hierarchical segmentations in real world scenes.

# 119

Strong Double Blind

Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

Xi Yang · Songsong Duan · Nannan Wang · Xinbo Gao

Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03% and 66.85% Top-1 Loc, respectively.

# 136

Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation

Zhenliang Ni · Xinghao Chen · Yingjie Zhai · Yehui Tang · Yunhe Wang

Semantic segmentation is an important task for many applications but it is still quite challenging to achieve advanced performance with limited computational costs. In this paper, we present CGRSeg, an efficient yet competitive segmentation framework based on context-guided spatial feature reconstruction. In it, a Rectangular Self-Calibration Module is carefully designed for spatial feature reconstruction and pyramid context extraction. It captures the global context in both horizontal and vertical directions and gets the axial global context to explicitly model rectangular key areas. A shape self-calibration function is designed to make the key areas more close to the foreground object. Besides, a lightweight Dynamic Prototype Guided head is proposed to improve the classification of foreground objects by explicit class embedding. Our CGRSeg is extensively evaluated on ADE20K, COCO-Stuff, and Pascal Context benchmarks, and achieves state-of-the-art semantic performance. Specifically, it achieves 43.6% mIoU on ADE20K with only 4.0 GFLOPs, which is 0.9% and 2.5% mIoU better than SeaFormer and SegNeXt but with about 38.0% fewer GFLOPs.

# 142

Strong Double Blind

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Junyi Li · Junfeng Wu · Weizhi Zhao · Song Bai · Xiang Bai

We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts. By incorporating a large amount of object-level data, the hierarchical relationships can be extended, enabling PartGLEE to recognize a rich variety of parts. We conduct comprehensive empirical studies to validate the effectiveness of our method, PartGLEE achieves the state-of-the-art performance across various part-level tasks and maintain comparable results on object-level tasks. Our further analysis indicates that the hierarchical cognitive ability of PartGLEE is able to facilitate a detailed comprehension in images for mLLMs. Code will be released.

# 134

Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels

Rui Huang · Songyou Peng · Ayca Takmaz · Federico Tombari · Marc Pollefeys · Shiji Song · Gao Huang · Francis Engelmann

Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Furthermore, models trained on this data typically struggle to recognize object classes beyond the annotated training classes, i.e., they do not generalize well to unseen domains and require additional domain-specific annotations. In contrast, recent 2D foundation models have demonstrated strong generalization and impressive zero-shot abilities, inspiring us to incorporate these characteristics from 2D models into 3D models. Therefore, we explore the use of image segmentation foundation models to automatically generate high-quality training labels for 3D segmentation models. The resulting model, Segment3D, generalizes significantly better than the models trained on costly manual 3D labels and enables easily adding new training data to further boost the segmentation performance.

# 75

Strong Double Blind

Continual Learning and Unknown Object Discovery in 3D Scenes via Self-Distillation

Mohamed El Amine Boudjoghra · Jean Lahoud · Salman Khan · Hisham Cholakkal · Rao M Anwer · Fahad Shahbaz Khan

Open-world 3D instance segmentation is a recently introduced problem with diverse applications, notably in continually learning embodied agents. This task involves segmenting unknown instances, and learning new instances when their labels are introduced. However, prior research in the open-world domain has traditionally addressed the two sub-problems, namely continual learning and unknown object identification, separately. This approach has resulted in limited performance on unknown instances and cannot effectively mitigate catastrophic forgetting. Additionally, these methods bypass the utilization of the information stored in the previous version of the continual learning model, instead relying on a dedicated memory to store historical data samples, which inevitably leads to an expansion of the memory budget. In this paper, we argue that continual learning and unknown class identification should be tackled in conjunction. Therefore, we propose a new exemplar-free approach for 3D continual learning and the discovery of unknown classes through self-distillation. Our approach leverages the pseudo-labels generated by the model from the preceding task to improve the unknown predictions during training while simultaneously mitigating catastrophic forgetting. By integrating these pseudo-labels into the continual learning process, we achieve enhanced performance in handling unknown classes. We validate the efficacy of the proposed approach via comprehensive experiments on various splits of the ScanNet200 dataset, showcasing superior performance in continual learning and unknown class retrieval compared to the state-of-the-art. Our codes will be publicly released.

# 114

Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation

Xinyu Yang · Hossein Rahmani · Sue Black · Bryan M. Williams

Class activation maps (CAMs) are commonly employed in weakly supervised semantic segmentation (WSSS) to produce pseudo-labels. Due to incomplete or excessive class activation, existing studies often resort to offline CAM refinement, introducing additional stages or proposing offline modules. This can cause optimization difficulties for single-stage methods and limit generalizability. In this study, we aim to reduce the observed CAM inconsistency and error to mitigate reliance on refinement processes. We propose an end-to-end WSSS model incorporating guided CAMs, wherein our segmentation model is trained while concurrently optimizing CAMs online. Our method, Co-training with Swapping Assignments (CoSA), leverages a dual-stream framework, where one sub-network learns from the swapped assignments generated by the other. We introduce three techniques in this framework: i) soft perplexity-based regularization to penalize uncertain regions; ii) a threshold-searching approach to dynamically revise the confidence threshold; and iii) contrastive separation to address the coexistence problem. CoSA demonstrates exceptional performance, achieving mIoU of 76.2% and 51.0% on VOC and COCO validation datasets, respectively, surpassing existing baselines by a substantial margin. Notably, CoSA is the first single-stage approach to outperform all existing multi-stage methods including those with additional supervision. Source code will be publicly available.

# 113

Strong Double Blind

Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier

Prantik Howlader · Srijan Das · Hieu Le · Dimitris Samaras

Incorporating pixel contextual information is critical for accurate segmentation. In this paper, we show that an effective way to incorporate contextual information is through a patch-based classifier. This patch classifier is trained to identify classes present within an image region, which facilitates the elimination of distractors and enhances the classification of small object segments. Specifically, we introduce \textbf{Multi-scale Patch-based Multi-label Classifier} (MPMC), a novel plug-in module designed for existing semi-supervised segmentation (SSS) frameworks. MPMC offers patch-level supervision, enabling the discrimination of pixel regions of different classes within a patch. Furthermore, MPMC learns an adaptive pseudo-label weight, using patch-level classification to alleviate the impact of the teacher’s noisy pseudo-label supervision on the student. This lightweight module can be integrated into any SSS framework, significantly enhancing their performance. We demonstrate the efficacy of our proposed MPMC by integrating it into four SSS methodologies and improving them across two natural image and one medical segmentation dataset, notably improving the segmentation results of the baselines across all the three datasets.

# 135

Strong Double Blind

Bayesian Self-Training for Semi-Supervised 3D Segmentation

Ozan Unal · Christos Sakaridis · Luc Van Gool

3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic n-partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state-of-the-art results for our semi-supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised-only baselines on ScanRefer.

# 133

Strong Double Blind

Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation

Zhaoyang Li · Yuan Wang · Wangkai Li · Rui Sun · Tianzhu Zhang

Point cloud few-shot semantic segmentation (PC-FSS) aims to segment targets of novel categories in a given query point cloud with only a few annotated support samples. The current top-performing prototypical learning methods employ prototypes originating from support samples to direct the classification of query points. However, the inherent fragility of point-level matching and the prevalent intra-class diversity pose great challenges to this cross-instance matching paradigm, leading to erroneous background activations or incomplete target excavation. In this work, we propose a simple yet effective framework in the spirit of Decoupled Localization and Expansion (DLE). The proposed DLE, including a structural localization module (SLM) and a self-expansion module (SEM), enjoys several merits. First, structural information is injected into the matching process through the agent-level correlation in SLM, and the confident target region can thus be precisely located. Second, more reliable intra-object similarity is harnessed in SEM to derive the complete target, and the conservative expansion strategy is introduced to reasonably constrain the expansion. Extensive experiments on two challenging benchmarks under different settings demonstrate that DLE outperforms previous state-of-the-art approaches by large margins.

# 131

Strong Double Blind

CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection

Jinglin Zhan · Tiejun Liu · Rengang Li · Zhaoxiang Zhang · Yuntao Chen

Large-scale 3D bounding box annotation is crucial for LiDAR object detection but comes at a high cost. Semi-supervised object detection (SSOD) offers promising solutions to leverage unannotated data, but the predominant pseudo-labeling approach requires careful hyperparameter tuning for training on noisy teacher labels. In this work, we propose a Cross-Scan Object Transfer (CSOT) paradigm for LiDAR SSOD. Central to our approach is Hotspot Network, a transformer-based network that predicts possible placement locations for annotated objects in unannotated scans and assigns scores to each location. By leveraging these contextual consistent location predictions, CSOT successfully enables object copy-paste in LiDAR SSOD for the first time. To train object detectors on partially annotated scans generated by CSOT, we adopt a Spatial-Aware classification loss throughout our partial supervision to handle false negative issues caused by treating all unlabeled objects as background. We conduct extensive experiments to verify the efficacy and generality of our method. Compared to other state-of-the-art label-efficient methods used in LiDAR detection, our approach requires the least amount of annotation while achieves the best detectior. Using only 1% of the labeled data on the Waymo dataset, our semi-supervised detector achieves performance on par with the fully supervised baseline. Similarly, on the nuScenes dataset, our semi-supervised CenterPoint reaches 99% of the fully supervised model's detection performance in terms of NDS score, while using just 5% of the labeled data.

# 204

Strong Double Blind

Interactive 3D Object Detection with Prompts

Ruifei Zhang · Xiangru Lin · Wei Zhang · Jincheng Lu · Xuekuan Wang · Xiao Tan · Yingying Li · Errui Ding · Jingdong Wang · Guanbin Li

The evolution of 3D object detection hinges not only on advanced models but also on effective and efficient annotation strategies. Despite this progress, the labor-intensive nature of 3D object annotation remains a bottleneck, hindering further development in the field. This paper introduces a novel approach, incorporated with prompt in 2D, detect in 3D'' anddetect in 3D, refine in 3D'' strategies, to 3D object annotation: multi-modal interactive 3D object detection. Firstly, by allowing users to engage with simpler 2D interaction prompts (e.g., clicks or boxes on a camera image or a bird's eye view), we bridge the complexity gap between 2D and 3D spaces, reimagining the annotation workflow. Besides, Our framework also supports flexible iterative refinement to the initial 3D annotations, further assisting annotators in achieving satisfying results. Evaluation on the nuScenes dataset demonstrates the effectiveness of our method. And thanks to the prompt-driven and interactive designs, our approach also exhibits outstanding performance in open-set scenarios. This work not only offers a potential solution to the 3D object annotation problem but also paves the way for further innovations in the 3D object detection community.

# 123

Strong Double Blind

SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection

Huafeng Chen · Pengxu Wei · Guangqian Guo · shan gao

Most Camouflaged Object Detection (COD) methods heavily rely on mask annotations, which are time-consuming and labor-intensive to acquire. Existing weakly-supervised COD approaches exhibit significantly inferior performance compared to fully-supervised methods and struggle to simultaneously support all the existing types of camouflaged object labels, including scribbles, bounding boxes, and points. Even for Segment Anything Model (SAM), it is still problematic to handle the weakly-supervised COD and it typically encounters challenges of prompt compatibility of the scribble labels, extreme response, semantically erroneous response, and unstable feature representations, producing unsatisfactory results in camouflaged scenes. To mitigate these issues, we propose a unified COD framework in this paper, termed SAM-COD, which is capable of supporting arbitrary weakly-supervised labels. Our SAM-COD employs a prompt adapter to handle scribbles as prompts based on SAM. Meanwhile, we introduce response filter and semantic matcher modules to improve the quality of the masks obtained by SAM under COD prompts. To alleviate the negative impacts of inaccurate mask predictions, a new strategy of prompt-adaptive knowledge distillation is utilized to ensure a reliable feature representation. To validate the effectiveness of our approach, we have conducted extensive empirical experiments on three mainstream COD benchmarks. The results demonstrate the superiority of our method against state-of-the-art weakly-supervised and even fully-supervised methods. Our source codes and trained models will be publicly released.

# 125

Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection

Gaurav Bhatt · Leonid Sigal · James Ross

Modern pre-trained architectures struggle to retain previous information while undergoing continuous fine-tuning on new tasks. Despite notable progress in continual classification, systems designed for complex vision tasks such as detection or segmentation still struggle to attain satisfactory performance. In this work, we introduce a memory-based detection transformer architecture to adapt a pre-trained DETR-style detector to new tasks while preserving knowledge from previous tasks. We propose a novel localized query function for efficient information retrieval from memory units, aiming to minimize forgetting. Furthermore, we identify a fundamental challenge in continual detection referred to as {\em background relegation}. This arises when object categories from earlier tasks reappear in future tasks, potentially without labels, leading them to be implicitly treated as background. This is an inevitable issue in continual detection or segmentation. The introduced continual optimization technique effectively tackles this challenge. Finally, we assess the performance of our proposed system on continual detection benchmarks and demonstrate that our approach surpasses the performance of existing state-of-the-art resulting in 5-7\% improvements on MSCOCO and PASCAL-VOC on the task of continual detection.

# 140

Benchmarking Object Detectors with COCO: A New Path Forward

Shweta Singh · Aayan Yadav · Jitesh Jain · Humphrey Shi · Justin Johnson · Karan Desai

The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer, we inspect thousands of masks from COCO (2017 version) and uncover different types of errors such as imprecise mask boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to the prevalence of COCO, we choose to correct these errors to maintain continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner set of annotations with visibly better mask quality than COCO-2017. We evaluate fifty object detectors and find that models that predict visually sharper masks score higher on COCO-ReM, affirming that they were being incorrectly penalized due to errors in COCO-2017. Moreover, our models trained using COCO-ReM converge faster and score higher than their larger variants trained using COCO-2017, highlighting the importance of data quality in improving object detectors. With these findings, we advocate using COCO-ReM for future object detection research. Our dataset is available at https://cocorem.xyz

# 189

Strong Double Blind

Frequency-Spatial Entanglement Learning for Camouflaged Object Detection

Yanguang Sun · Chunyan Xu · Jian Yang · Hanyu Xuan · Lei Luo

Camouflaged object detection (COD) has attracted a lot of attention in computer vision. The main challenge lies in the high degree of similarity between camouflaged objects and their surroundings in the spatial domain, making identification difficult. Existing methods attempt to reduce the impact of pixel similarity by maximizing the distinguishing ability of spatial features with complicated design, but often ignore the sensitivity and locality of features in the spatial domain, leading to sub-optimal results. In this paper, we propose a new approach to address this issue by jointly exploring the representation in the frequency and spatial domains, introducing the Frequency-Spatial Entanglement Learning (FSEL) method. This method consists of a series of well-designed Entanglement Transformer Blocks (ETB) for representation learning, a Joint Domain Perception Module (JDPM) for semantic enhancement, and a Dual-domain Reverse Parser (DRF) for feature integration in the frequency and spatial domains. Specifically, the ETB utilizes frequency self-attention (FSA) to effectively characterize the relationship between different frequency bands, while the entanglement feed-forward network (EFFN) facilitates information interaction between features of different domains through entanglement learning. Our extensive experiments demonstrate the superiority of FSEL over 21 state-of-the-art (SOTA) methods, through comprehensive quantitative and qualitative comparisons in three widely-used COD datasets.

# 176

GRA: Detecting Oriented Objects through Group-wise Rotating and Attention

Jiangshan Wang · Yifan Pu · Yizeng Han · Jiayi Guo · Yiru Wang · Xiu Li · Gao Huang

Oriented object detection, an emerging task in recent years, aims to identify and locate objects across varied orientations. This requires the detector to accurately capture the orientation information, which varies significantly within and across images. Despite the existing substantial efforts, simultaneously ensuring model effectiveness and parameter efficiency remains challenging in this scenario. In this paper, we propose a lightweight yet effective Group-wise Rotating and Attention (GRA) module to replace the convolution operations in backbone networks for this task. GRA can adaptively capture fine-grained features of objects with diverse orientations, comprising two key components: Group-wise Rotating and Group-wise Attention. Group-wise Rotating firstly divides the convolution kernel into groups, where each group extracts different object features by rotating at a specific angle according to the object orientation. Subsequently, Group-wise Attention is employed to adaptively enhance the object-related regions in the feature. The collaborative effort of these components enables GRA to effectively capture the various orientation information while maintaining parameter efficiency. Extensive experimental results demonstrate the superiority of our method. For example, GRA achieves a new state-of-the-art (SOTA) on the DOTA-v2.0 benchmark, while saving the parameters by nearly 50% compared to the previous SOTA method. Code will be released.

# 168

DQ-DETR: DETR with Dynamic Query for Tiny Object Detection

Yi-Xin Huang · Hou-I Liu · Hong-Han Shuai · Wen-Huang Cheng

Despite previous DETR-like methods having performed successfully in generic object detection, tiny object detection is still a challenging task for them since the positional information of object queries is not customized for detecting tiny objects, whose scale is extraordinarily smaller than general objects. Also, DETR-like methods using a fixed number of queries make them unsuitable for aerial datasets, which mostly contain tiny objects, and the numbers of instances are imbalanced between different images. Thus, we present a simple yet effective model, DQ-DETR, consisting of three components: categorical counting module, counting-guided feature enhancement, and dynamic query selection to solve the above-mentioned problems. DQ-DETR uses the prediction and density maps from the categorical counting module to dynamically adjust the number and positional information of object queries. Our model DQ-DETR outperforms previous CNN-based and DETR-like methods, achieving state-of-the-art mAP 30.2% on the AI-TOD-V2 dataset, which mostly consists of tiny objects.

# 178

Strong Double Blind

AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval

Pavel Suma · Giorgos Kordopatis-Zilos · Ahmet Iscen · Giorgos Tolias

This work investigates the problem of instance-level image retrieval with re-ranking with the constraint of memory efficiency, ultimately aiming to limit memory usage to 1KB per image. Departing from the prevalent focus on performance enhancements, this work prioritizes the crucial trade-off between performance and memory requirements. The proposed model employs a transformer-based architecture designed to estimate image-to-image similarity by capturing interactions within and across images based on their local descriptors. A distinctive property of the model is the capability for asymmetric similarity estimation. Database images are represented with a smaller number of descriptors compared to query images, enabling performance improvements without increasing memory consumption. To ensure adaptability across different applications, a universal model is introduced that adjusts to varying descriptor set cardinalities during the testing phase. Results on standard benchmarks demonstrate the superiority of our approach over both hand-crafted and learned models. In particular, compared with current state-of-the-art methods that overlook their memory footprint, our approach not only attains superior performance but does so with a significantly reduced memory footprint. We intend to make our code publicly available.

# 110

Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation

Zhen Zhao · Zicheng Wang · Dian Yu · Longyue Wang · Yixuan Yuan · Luping Zhou

Semi-supervised medical image segmentation has shown promise in training models with limited labeled data. However, current dominant teacher-student based approaches can suffer from the confirmation bias. To address this challenge, we propose AD-MT, an alternate diverse teaching approach in a teacher-student framework. It involves a single student model and two non-trainable teacher models that are momentum-updated periodically and randomly in an alternate fashion. To mitigate the confirmation bias via the diverse supervision, the core of AD-MT lies in two proposed modules: the Random Periodic Alternate (RPA) Updating Module and the Conflict-Combating Module (CCM). The RPA schedules an alternating diverse updating process with complementary data batches, distinct data augmentation, and random switching periods to encourage diverse reasoning from different teaching perspectives. The CCM employs an entropy-based ensembling strategy to encourage the model to learn from both the consistent and conflicting predictions between the teachers. Experimental results demonstrate the effectiveness and superiority of AD-MT on the 2D and 3D medical segmentation benchmarks across various semi-supervised settings.

# 112

Unleashing the Power of Prompt-driven Nucleus Instance Segmentation

Zhongyi Shui · Yunlong Zhang · Kai Yao · Chenglu Zhu · Sunyi Zheng · Jingxiong Li · Honglin Li · YUXUAN SUN · Ruizhe Guo · Lin Yang

Nucleus instance segmentation in histology images is crucial for a broad spectrum of clinical applications. Current dominant algorithms rely on regression of nuclear proxy maps. Distinguishing nucleus instances from the estimated maps requires carefully curated post-processing, which is error-prone and parameter-sensitive. Recently, the Segment Anything Model (SAM) has earned huge attention in medical image segmentation, owing to its impressive generalization ability and promptable property. Nevertheless, its potential on nucleus instance segmentation remains largely underexplored. In this paper, we present a novel prompt-driven framework that consists of a nucleus prompter and SAM for automatic nucleus instance segmentation. Specifically, the prompter is developed to generate a unique point prompt for each nucleus, while SAM is fine-tuned to produce its corresponding mask. Furthermore, we propose to integrate adjacent nuclei as negative prompts to enhance model's capability to identify overlapping nuclei. Without complicated post-processing, our proposed method sets a new state-of-the-art performance on three challenging benchmarks. The source code is available in the supplementary materials.

# 107

Strong Double Blind

cDP-MIL: Robust Multiple Instance Learning via Cascaded Dirichlet Process

Yihang Chen · TSAI HOR CHAN · Guosheng Yin · Yuming Jiang · Lequan Yu

Multiple instance learning (MIL) has been extensively applied to whole slide histopathology image (WSI) analysis. The existing aggregation strategy in MIL, which primarily relies on the first-order distance (e.g., mean difference) between instances, fails to accurately approximate the true feature distribution of each instance, leading to biased slide-level representations. Moreover, the scarcity of WSI observations easily leads to model overfitting, resulting in unstable testing performance and limited generalizability. To tackle these challenges, we propose a new Bayesian nonparametric framework for multiple instance learning, which adopts a cascade of Dirichlet processes (cDP) to incorporate the instance-to-bag characteristic of the WSIs. We perform feature aggregation based on the latent clusters formed by the Dirichlet process, which incorporates the covariances of the patch features and forms more representative clusters. We then perform bag-level prediction with another Dirichlet process model on the bags, which imposes a natural regularization on learning to prevent overfitting and enhance generalizability. Moreover, as a Bayesian nonparametric method, the cDP model can accurately generate posterior uncertainty, which allows for the detection of outlier samples and tumor localization. Extensive experiments on five WSI benchmarks validate the superior performance of our method, as well as its generalizability and ability to estimate uncertainties. Codes are anonymously available at https://anonymous.4open.science/r/cDPMIL-F130.

# 109

Strong Double Blind

Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification

Linhao Qu · Dingkang Yang · Dan Huang · Qinhao Guo · rongkui luo · Shaoting Zhang · Xiaosong Wang

Current multi-instance learning algorithms for pathology image analysis often require a substantial number of Whole Slide Images for effective training but exhibit suboptimal performance in scenarios with limited learning data. In clinical settings, restricted access to pathology slides is inevitable due to patient privacy concerns and the prevalence of rare or emerging diseases. The emergence of the Few-shot Weakly Supervised WSI Classification accommodates the significant challenge of the limited slide data and sparse slide-level labels for diagnosis. Prompt learning based on the pre-trained models (e.g., CLIP) appears to be a promising scheme for this setting; however, current research in this area is limited, and existing algorithms often focus solely on patch-level prompts or confine themselves to language prompts. This paper proposes a multi-instance prompt learning framework enhanced with pathology knowledge, i.e., integrating visual and textual prior knowledge into prompts at both patch and slide levels. The training process employs a combination of static and learnable prompts, effectively guiding the activation of pre-trained models and further facilitating the diagnosis of key pathology patterns. Lightweight Messenger and Summary layers are introduced to model relationships between patches and slides within the same patient data. Additionally, alignment-wise contrastive losses ensure the feature-level alignment between visual and textual learnable prompts for both patches and slides. Our method demonstrates superior performance in three challenging clinical tasks, significantly outperforming comparative few-shot methods.

# 108

Strong Double Blind

Learning with Counterfactual Explanations for Radiology Report Generation

Mingjie Li · Haokun Lin · Liang Qiu · Xiaodan Liang · Ling Chen · Abdulmotaleb Elsaddik · Xiaojun Chang

Due to the common content of anatomy, radiology images with their corresponding reports exhibit high similarity. Such inherent data bias can predispose automatic report generation models to learn entangled and spurious representations resulting in misdiagnostic reports. To tackle these, we propose a novel \textbf{Co}unter\textbf{F}actual \textbf{E}xplanations-based framework (CoFE) for radiology report generation. Counterfactual explanations serve as a potent tool for understanding how decisions made by algorithms can be changed by asking ``what if'' scenarios. By leveraging this concept, CoFE can learn non-spurious visual representations by contrasting the representations between factual and counterfactual images. Specifically, we derive counterfactual images by swapping a patch between positive and negative samples until a predicted diagnosis shift occurs. Here, positive and negative samples are the most semantically similar but have different diagnosis labels. Additionally, CoFE employs a learnable prompt to efficiently fine-tune the pre-trained large language model, encapsulating both factual and counterfactual content to provide a more generalizable prompt representation. Extensive experiments on two benchmarks demonstrate that leveraging the counterfactual explanations enables CoFE to generate semantically coherent and factually complete reports and outperform in terms of language generation and clinical efficacy metrics.

# 115

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

Yogesh Kumar · Pekka Marttinen

We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the ``modality gap'' -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases considerable improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.

# 88

Strong Double Blind

Few-shot Defect Image Generation based on Consistency Modeling

Qingfeng Shi · Jing Wei · Fei Shen · Zhengtao Zhang

Image generation can solve insufficient labeled data issues in defect detection. Most defect generation methods are only trained on a single product without considering the consistencies among multiple products, leading to poor quality and diversity of generated results. To address these issues, we propose DefectDiffu, a novel text-guided diffusion method to model both intra-product background consistency and inter-product defect consistency across multiple products and modulate the consistency perturbation directions to control product type and defect strength, achieving diversified defect image generation. Firstly, we leverage a text encoder to separately provide consistency prompts for background, defect, and fusion parts of the disentangled integrated architecture, thereby disentangling defects and normal backgrounds. Secondly, we propose the double-free strategy to generate defect images through two-stage perturbation of consistency direction, thereby controlling product type and defect strength by adjusting the perturbation scale. Besides, DefectDiffu can generate defect mask annotations utilizing cross-attention maps from the defect part. Finally, to improve the generation quality of small defects and masks, we propose the adaptive attention-enhance loss to increase the attention to defects. Experimental results demonstrate that DefectDiffu surpasses state-of-the-art methods in terms of generation quality and diversity, thus effectively improving downstream defection performance. Moreover, defect perturbation directions can be transferred among various products to achieve zero-shot defect generation, which is highly beneficial for addressing insufficient data issues. The code will be released upon paper acceptance.

# 121

Placing Objects in Context via Inpainting for Out-of-distribution Segmentation

Pau de Jorge Aranda · Riccardo Volpi · Puneet Dokania · Philip Torr · Grégory Rogez

When deploying a semantic segmentation model into the real world, it will inevitably encounter semantic classes that were not seen during training. Therefore, to ensure a safe deployment of such systems, it is crucial to accurately evaluate and improve their anomaly segmentation capabilities. However, acquiring and labelling semantic segmentation data is expensive and unanticipated conditions are long-tail and potentially hazardous. Indeed, existing anomaly segmentation datasets capture a limited number of anomalies, lack realism or have strong domain shifts. In this paper, we propose the Placing Objects in Context (POC) pipeline to realistically add any object into any image via diffusion models. POC can be used to easily extend any dataset with an arbitrary number of objects. In our experiments, we present different anomaly segmentation datasets based on POC-generated data and show that POC can improve the performance of recent state-of-the-art anomaly fine-tuning methods across several standardized benchmarks. POC is also effective for learning new classes. For example, we utilize it to augment Cityscapes samples by incorporating a subset of Pascal classes and demonstrate that models trained on such data achieve comparable performance to the Pascal-trained baseline. This corroborates the low synth2real gap of models trained on POC-generated images.

# 44

Strong Double Blind

Learning Diffusion Models for Multi-View Anomaly Detection

Chieh Liu · Yu-Min Chu · Ting-I Hsieh · Hwann-Tzong Chen · Tyng-Luh Liu

We are exploring an emerging formulation in anomaly detection (AD) where multiple instances of the same object are produced simultaneously and distinctly to address the limitation that using only a single instance may not effectively capture any underlying defects. More specifically, we concentrate on a specific scenario where each object of interest is linked to seven distinct data views/representations. The first six views involve capturing images with a stationary camera under six different lighting conditions, while the seventh view pertains to the 3D normal information. We refer to our intended task as multi-view anomaly detection. To tackle this problem, our approach involves training a view-invariant ControlNet that can produce consistent feature maps regardless of the data views. This training strategy enables us to mitigate the impact of varying lighting conditions and to fuse information from both the RGB color appearance and the 3D normal geometry effectively. Moreover, as the diffusion process is not deterministic, we utilize the DDIM scheme to improve the applicability of our established memory banks of diffusion-based features for anomaly detection inference. To demonstrate the efficacy of our approach, we present extensive ablation studies and state-of-the-art experimental results on the Eyecandies dataset.

# 41

Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection

Liren He · Zhengkai Jiang · Jinlong Peng · Wenbing Zhu · Liang Liu · Qiangang Du · Xiaobin Hu · Mingmin Chi · Yabiao Wang · Chengjie Wang

In the field of multi-class anomaly detection, reconstruction-based methods derived from single-class anomaly detection face the well-known challenge of learning shortcuts'', wherein the model fails to learn the patterns of normal samples as it should, opting instead for shortcuts such as identity mapping or artificial noise elimination. Consequently, the model becomes unable to reconstruct genuine anomalies as normal instances, resulting in a failure of anomaly detection. To counter this issue, we present a novel unified feature reconstruction-based anomaly detection framework termed RLR (Reconstruct features from a Learnable Reference representation). Unlike previous methods, RLR utilizes learnable reference representations to compel the model to learn normal feature patterns explicitly, thereby prevents the model from succumbing to thelearning shortcuts'' issue. Additionally, RLR incorporates locality constraints into the learnable reference to facilitate more effective normal pattern capture and utilizes a masked learnable key attention mechanism to enhance robustness. Evaluation of RLR on the 15-category MVTec-AD dataset and the 12-category VisA dataset shows superior performance compared to state-of-the-art methods under the unified setting. The code of RLR will be publicly available.

# 198

Strong Double Blind

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

Yuchen Yang · Kwonjoon Lee · Behzad Dariush · Yinzhi Cao · Shao-Yuan Lo

Video Anomaly Detection (VAD) is crucial for applications such as security surveillance and autonomous driving. However, existing VAD methods provide little rationale behind detection, hindering public trust in real-world deployments. In this paper, we approach VAD with a reasoning framework. Although Large Language Models (LLMs) have shown revolutionary reasoning ability, we find that their direct use falls short of VAD. Specifically, the implicit knowledge pre-trained in LLMs focuses on general context and thus may not apply to every specific real-world VAD scenario, leading to inflexibility and inaccuracy. To address this, we propose AnomalyRuler, a novel rule-based reasoning framework for VAD with LLMs. AnomalyRuler comprises two main stages: induction and deduction. In the induction stage, the LLM is fed with few-shot normal reference samples and then summarizes these normal patterns to induce a set of rules for detecting anomalies. The deduction stage follows the induced rules to spot anomalous frames in test videos. Additionally, we design rule aggregation, perception smoothing, and robust reasoning strategies to further enhance AnomalyRuler's robustness. AnomalyRuler is the first reasoning approach for the one-class VAD task, which requires only few-normal-shot prompting without the need for full-shot training, thereby enabling fast adaption to various VAD scenarios. Comprehensive experiments across four VAD benchmarks demonstrate AnomalyRuler's state-of-the-art detection performance and reasoning ability. AnomalyRuler is open-source and available at: \url{https://anonymous.4open.science/r/anomaly_detection-BB71}.

# 122

Strong Double Blind

Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent

NianHui Guo · Hong Guo · Christoph Meinel · Haojin Yang

Binary Neural Networks (BNNs) offer a promising avenue toward achieving efficient deep-learning models but are hindered by the inherent challenge of aligning noisy floating-point gradients with binary parameters. To address this, we introduce Diode, a groundbreaking optimizer designed explicitly for BNNs that bridges this gap by utilizing the gradient's sign information in a unique, latent-weight-free approach. By focusing on the gradient sign's lower-order moment estimate for parameter updates, Diode uniformly fine-tunes binary parameters, significantly enhancing model convergence without the dependency on 32-bit latent weights or embedding buffers. This paper showcases Diode's superior performance through comprehensive evaluations on a variety of vision and Natural Language Processing (NLP) tasks. Remarkably, Diode advances the state-of-the-art by increasing BNext-18 Top-1 accuracy on ImageNet ILSVRC2012 by 0.96\% with eightfold fewer training iterations. In the case of ReActNet, Diode not only matches but slightly exceeds previous benchmarks without resorting to complex multi-stage optimization strategies, effectively halving the training duration. Additionally, Diode proves its robust generalization capability on the binary BERT architecture within the GLUE benchmark, outperforming the existing BiT design by 3.3\% without data augmentation and establishing a new SOTA accuracy of 78.8\% with augmentation. The implementation of Diode will be made available to the public at: (insert the link in the final version).

# 139

Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training

Hyesong Choi · Hyejin Park · Kwang Moo Yi · Sungmin Cha · Dongbo Min

In this paper, we introduce Saliency-Based Adaptive Masking (SBAM), a novel and cost-effective approach that significantly enhances the pre-training performance of Masked Image Modeling (MIM) approaches by prioritizing token salience. Our method provides robustness against variations in masking ratios, effectively mitigating the performance instability issues common in existing methods. This relaxes the sensitivity of MIM-based pre-training to masking ratios, which in turn allows us to propose an adaptive strategy for `tailored' masking ratios for each data sample, which no existing method can provide. Toward this goal, we propose an Adaptive Masking Ratio (AMR) strategy that dynamically adjusts the proportion of masking for the unique content of each image based on token salience. We show that our method significantly improves over the state-of-the-art in mask-based pre-training on the ImageNet-1K dataset.

# 175

SNP: Structured Neuron-level Pruning to Preserve Attention Scores

Kyunghwan Shim · Jaewoong Yun · Shinkook Choi

Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1 times faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP accelerates the efficiently designed Transformer model, EfficientFormer, by 1.74 times on the Jetson Nano with acceptable performance degradation.

# 173

Tiny Models are the Computational Saver for Large Models

Qingyuan Wang · Barry Cardiff · Antoine Frappé · Benoit Larras · Deepu John

This paper introduces TinySaver, an early-exit-like dynamic model compression approach which employs tiny models to substitute large models adaptively. Distinct from traditional compression techniques, dynamic methods like TinySaver can leverage the difficulty differences to allow certain inputs to complete their inference processes early, thereby conserving computational resources. Most existing early exit designs are implemented by attaching additional network branches to the model's backbone. Our study, however, reveals that completely independent tiny models can replace a substantial portion of the larger models' job with minimal impact on performance. Employing them as the first exit can remarkably enhance computational efficiency. By searching and employing the most appropriate tiny model as the computational saver for a given large model, the proposed approaches work as a novel and generic method to model compression. This finding will help the research community in exploring new compression methods to address the escalating computational demands posed by rapidly evolving AI models. Our evaluation of this approach in ImageNet-1k classification demonstrates its potential to reduce the number of compute operations by up to 90\%, with only negligible losses in performance, across various modern vision models. The code of this work will be available.

# 171

Strong Double Blind

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Shibo Jie · Yehui Tang · Jianyuan Guo · Zhi-Hong Deng · Kai Han · Yunhe Wang

Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained model to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k benchmark, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively.

# 29

Strong Double Blind

Trainable Highly-expressive Activation Functions

Irit Chelly · Shahaf Finder · Shira Ifergane · Oren Freifeld

Nonlinear activation functions are pivotal to the success of deep neural nets, and choosing the appropriate activation function can significantly affect their performance. Most networks use fixed activation functions (e.g., ReLU, GELU, etc.), which can be sub-optimal as their expressiveness is limited. Furthermore, distinct layers may benefit from diverse activation functions. Consequently, there has been a growing interest in trainable activation functions. In this paper, we introduce DiTAC, a trainable highly-expressive activation function based on an efficient diffeomorphic transformation. Despite introducing only a negligible number of trainable parameters, DiTAC enhances model expressiveness and performance, often yielding substantial improvements. It also outperforms existing activation functions (regardless whether the latter are fixed or trainable) in tasks such as semantic segmentation, image generation, regression problems, and image classification. Our code will be made publicly available upon acceptance.

# 33

Strong Double Blind

HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion

Junhao Su · Chenghao He · Feiyu Zhu · Xiaojie Xu · Dongzhi Guan · Chenyang Si

Traditional deep learning relies on end-to-end backpropagation for training, but it suffers from drawbacks such as high memory consumption and not aligning with biological neural networks. Recent advancements have introduced locally supervised learning, which divides networks into modules with isolated gradients and trains them locally. However, this approach can lead to performance lag due to limited interaction between these modules, and the design of auxiliary networks occupies a certain amount of GPU memory. To overcome these limitations, we propose a novel model called HPFF that performs hierarchical locally supervised learning and patch-level feature computation on the auxiliary networks. Hierarchical Locally Supervised Learning (HiLo) enables the network to learn features at different granularity levels along their respective local paths. Specifically, the network is divided into two-level local modules: independent local modules and cascade local modules. The cascade local modules combine two adjacent independent local modules, incorporating both updates within the modules themselves and information exchange between adjacent modules. Patch Feature Fusion (PFF) reduces GPU memory usage by splitting the input features of the auxiliary networks into patches for computation. By averaging these patch-level features, it enhances the network's ability to focus more on those patterns that are prevalent across multiple patches. Furthermore, our method exhibits strong generalization capabilities and can be seamlessly integrated with existing techniques. We conduct experiments on CIFAR-10, STL-10, SVHN, and ImageNet datasets, and the results demonstrate that our proposed HPFF significantly outperforms previous approaches, consistently achieving state-of-the-art performance across different datasets.

# 35

To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of Point Cloud Transfer Learning

Souhail Hadgi · Lei Li · Maks Ovsjanikov

Transfer learning has long been a key factor in the advancement of many fields including 2D image analysis. Unfortunately, its applicability in 3D data processing has been relatively limited. While several approaches for 3D transfer learning have been proposed in recent literature, with contrastive learning gaining particular prominence, most existing methods in this domain have only been studied and evaluated in limited scenarios. Most importantly, there is currently a lack of principled understanding of both when and why 3D transfer learning methods are applicable. Remarkably, even the applicability of standard supervised pre-training is poorly understood. In this work, we conduct the first in-depth quantitative and qualitative investigation of supervised and contrastive pre-training strategies and their utility in downstream 3D tasks. We demonstrate that layer-wise analysis of learned features provides significant insight into the downstream utility of trained networks. Informed by this analysis, we propose a simple geometric regularization strategy, which improves the transferability of supervised pre-training. Our work thus sheds light onto both the specific challenges of 3D transfer learning, as well as strategies to overcome them.

# 37

Strong Double Blind

SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning

Qi Qian · Yuanhong Xu · JUHUA HU

With the success of deep learning, deep features that are extracted as outputs from the last few layers of a pre-trained deep model have attracted much attention. Unlike hand-crafted features, deep features are data/task-dependent, while still performing well on different downstream tasks. With the recent advancement in unsupervised representation learning, in this work, we revisit the performance of the last layer features extracted from self-supervised pre-trained models. Compared with fine-tuning that can explore diverse augmentations, e.g., random crop/flipping, in the original input space, obtaining appropriate semantic augmentation in the feature space of extracted deep features is challenging. To unleash the potential of deep features, we propose a novel semantic adversarial augmentation (SeA) in the feature space for learning with fixed deep features. Experiments are conducted on $11$ benchmark downstream classification tasks with $4$ popular pre-trained models. Our method is $2\%$ better than the baseline without SeA on average. Moreover, compared to the expensive fine-tuning that is expected to give better performance, SeA shows a comparable performance on $6$ out of $11$ tasks, demonstrating the effectiveness of our proposal in addition to its efficiency.

# 36

Strong Double Blind

Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation

Sehyung Lee · Mijung Kim · Yeongnam Chae · Bjorn Stenger

This paper introduces a pioneering approach to linearly controllable generative adversarial network (LC-GAN) driven by unsupervised learning. Departing from traditional methods relying on supervision signals or post-processing for latent feature disentanglement, our proposed technique enables unsupervised learning using only image data through contrastive feature categorization and spectral regularization. In our framework, the discriminator constructs geometry- and appearance-related feature spaces using a combination of image augmentation and contrastive representation learning. Leveraging these feature spaces, the generator autonomously categorizes input latent codes into geometry- and appearance-related features. Subsequently, the categorized features undergo projection into a subspace via our proposed spectral regularization, with each component adeptly controlling a distinct aspect of the generated image. Beyond providing fine-grained control over the generative model, our approach achieves state-of-the-art image generation quality on benchmark datasets, including FFHQ, CelebA-HQ, and AFHQ-V2.

# 71

Strong Double Blind

Diagnosing and Re-learning for Balanced Multimodal Learning

Yake Wei · Siwei Li · Ruoxuan Feng · Di Hu

To overcome the imbalanced multi-modal learning problem, where models prefer the training of specific modalities, existing methods propose to control the training of uni-modal encoders from different perspectives, taking the inter-modal performance discrepancy as the basis. However, the intrinsic limitation of modality capacity is ignored. The scarcely informative modalities are always recognized as ``worse-learnt'' ones in existing methods, which could force the model to memorize more noise, counterproductively affecting the multi-modal model ability. Moreover, the current modality modulation methods narrowly concentrate on selected worse-learnt modalities, even suppressing the training of others. Hence, it is essential to reasonably assess the learning state of each modality and take all modalities into account during balancing. To this end, we propose the Diagnosing & Re-learning method. The learning state of each modality is firstly estimated based on the separability of its uni-modal representation space, and then used to softly re-initialize the corresponding uni-modal encoder. In this way, encoders of worse learnt modalities are enhanced, simultaneously avoiding the over-training of other modalities. Accordingly, the multi-modal learning is effectively balanced and enhanced. Experiments covering multiple types of modalities and multi-modal frameworks demonstrate the superior performance of our simple-yet-effective method for balanced multi-modal learning.

# 141

Strong Double Blind

Visual Prompting via Partial Optimal Transport

MENGYU ZHENG · Zhiwei Hao · Yehui Tang · Chang Xu

Visual prompts represent a lightweight approach that adapts pre-trained models to downstream tasks without modifying the model parameters. They strategically transform the input and output through prompt engineering and label mapping, respectively. Yet, existing methodologies often overlook the synergy between these components, leaving the intricate relationship between them underexplored. To address this, we propose an Optimal Transport-based Label Mapping strategy (OTLM) that effectively reduces distribution migration and lessens the modifications required by the visual prompts. Specifically, we reconceptualize label mapping as a partial optimal transport problem, and introduce a novel transport cost matrix. Through the optimal transport framework, we establish a connection between output-side label mapping and input-side visual prompting. Additionally, we analyze frequency-based label mapping methods within this framework. We also offer an analysis of frequency-based label mapping techniques and demonstrate the superiority of our OTLM method. Our experiments across multiple datasets and various model architectures demonstrate significant performance improvements, which prove the effectiveness of the proposed method.

# 117

Strong Double Blind

Pseudo-Labelling Should Be Aware of Disguising Channel Activations

Changrui Chen · Kurt Debattista · Jungong Han

The pseudo-labelling algorithm is highly effective across various tasks, particularly in semi-supervised learning, yet its vulnerabilities are not always apparent on benchmark datasets, leading to suboptimal real-world performance. In this paper, we identified some channel activations in pseudo-labelling methods, termed disguising channel activations (abbreviated as disguising activations in the following sections), which exacerbate the confirmation bias issue when the training data distribution is inconsistent. Even state-of-the-art semi-supervised learning models exhibit significantly different levels of activation on some channels for data in different distributions, impeding the full potential of pseudo labelling. We take a novel perspective to address this issue by analysing the components of each channel's activation. Specifically, we model the activation of each channel as the mixture of two independent components. The mixture proportion enables us to identify the disguising activations, making it possible to employ our straightforward yet effective regularisation to attenuate the correlation between pseudo labels and disguising activations. This mitigation reduces the error risk of pseudo-label inference, leading to more robust optimization. The regularisation introduces no additional computing costs during the inference phase and can be seamlessly integrated as a plug-in into pseudo-labelling algorithms in various downstream tasks. Our experiments demonstrate that the proposed method achieves state-of-the-art results across 6 benchmark datasets in diverse vision tasks, including image classification, semantic segmentation, and object detection.

# 143

Strong Double Blind

Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Sungyeon Kim · Boseung Jeong · Donghyun Kim · Suha Kwak

Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval and open vocabulary segmentation, we demonstrate the broad applicability of R-Adapter. Our extensive experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.

# 43

Strong Double Blind

Unsupervised Representation Learning by Balanced Self Attention Matching

Daniel Shalam · Simon Korman

Many leading self-supervised methods for unsupervised representation learning, in particular those for embedding image features, are built on variants of the instance discrimination task, whose optimization is known to be prone to instabilities that can lead to feature collapse. Different techniques have been devised to circumvent this issue, including the use of negative pairs with different contrastive losses, the use of external memory banks, and breaking of symmetry by using separate encoding networks with possibly different structures. Our method, termed BAM, rather than directly matching features of different views (augmentations) of input images, is based on matching their self-attention vectors, which are the distributions of similarities to the entire set of augmented images of a batch. We obtain rich representations and avoid feature collapse by minimizing a loss that matches these distributions to their globally balanced and entropy regularized version, which is obtained through a simple self-optimal-transport computation. We ablate and verify our method through a wide set of experiments that show competitive performance with leading methods on both semi-supervised and transfer-learning benchmarks. Our implementation and pre-trained models will be made publicly available.

# 66

Strong Double Blind

Optimal Transport of Diverse Unsupervised Tasks for Robust Learning from Noisy Few-Shot Data

Xiaofan Que · Qi Yu

Noisy few-shot learning (NFSL) presents novel challenges primarily due to the interplay between noisy labels and limited training data. While data cleansing offers a viable solution to address noisy labels in the general learning settings, it exacerbates information loss in FSL due to limited training data, resulting in inadequate model training. To best recover the underlying task manifold corrupted by the noisy labels, we resort to learning from uniquely designed unsupervised auxiliary tasks to compensate for information loss. Using unsupervised tasks can effectively avoid additional annotation costs and minimize the risk of introducing additional label noises. However, a randomly constructed unsupervised task may misguide the model to learn sample-specific features that are likely to compromise the primary few-shot learning task due to the noisy weak learning signals. We propose to conduct novel auxiliary task selection to ensure the intra-diversity among the unlabeled samples within a task. Domain invariant features are then learned from carefully constructed auxiliary tasks to best recover the original data manifold. We conduct theoretical analysis to derive novel generalization bounds for learning with auxiliary tasks. Extensive experiments are conducted to demonstrate that our method outperforms existing noisy few-shot learning methods under various in-domain and cross-domain few-shot classification benchmarks.

# 49

Gradient-based Out-of-Distribution Detection

Taha Entesari · Sina Sharifi · Bardia Safaei · Vishal Patel · Mahyar Fazlyab

One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution. Addressing this issue is known as Out-of-Distribution (OOD) detection. Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate for OOD data during training to achieve improved performance. However, these methods fail to fully exploit the local information embedded in the auxiliary dataset. In this work, we propose the idea of leveraging the information embedded in the gradient of the loss function during training to enable the network to not only learn a desired OOD score for each sample but also to exhibit similar behavior in a local neighborhood around each sample data point. We also develop a novel energy-based sampling method to allow the network to be exposed to more informative OOD samples during the training phase. This is especially important when the auxiliary dataset is large. We demonstrate the effectiveness of our method through extensive experiments on several OOD benchmarks, improving the existing state-of-the-art FPR95 by 4% on our ImageNet experiment. We further provide a theoretical analysis through the lens of certified robustness and Lipschitz analysis to showcase the theoretical foundation of our work. We will publicly release our code after the review process.

# 45

Strong Double Blind

SLIM: Spuriousness Mitigation with Minimal Human Annotations

Xiwei Xuan · Ziquan Deng · Hsuan-Tien Lin · Kwan-Liu Ma

Recent studies highlight that deep learning models often learn spurious features mistakenly linked to labels, compromising their reliability in real-world scenarios where such correlations do not hold. Despite the increasing research effort, existing solutions often face two main challenges: they either demand substantial annotations of spurious attributes, or they yield less competitive outcomes with expensive training when additional annotations are absent. In this paper, we introduce SLIM, a cost-effective and performance-targeted approach to reducing spurious correlations in deep learning. Our method leverages a human-in-the-loop protocol featuring a novel attention labeling mechanism with a constructed attention representation space. SLIM significantly reduces the need for exhaustive additional labeling, requiring human input for fewer than 3% of instances. By prioritizing data quality over complicated training strategies, SLIM curates a smaller yet more feature-balanced data subset, fostering the development of spuriousness-robust models. Experimental validations across key benchmarks demonstrate that SLIM competes with or exceeds the performance of leading methods while significantly reducing costs. The SLIM framework thus presents a promising path for developing reliable models more efficiently.

# 118

Strong Double Blind

Modeling Label Correlations with Latent Context for Multi-Label Recognition

Zhao-Min Chen · Quan Cui · Ruoxi Deng · Jie Hu · Guodao Zhang

Label dependencies have been widely studied in multi-label image recognition for improving performances. Previous methods mainly considered label co-occurrences as label correlations. In this paper, we show that label co-occurrences may be insufficient to represent label correlations, and modeling label correlations relies on latent context information. To this end, we propose a latent context embedding information network for multi-label image recognition. Our proposal is straightforward and contains three key modules to correspondingly tackle three questions, \ie, where to locate the latent context information, how to utilize the latent context information, and how to model label correlations with context-aware features. First, the multi-level context feature fusion module fuses the multi-level feature pyramids to obtain sufficient latent context information. Second, the latent context information embedding module aggregates the latent context information into categorical features, and thus the label correlation can be directly established. Moreover, we use the label correlation capturing module to model label correlations with full and partial manners, respectively. Comprehensive experiments validate the correctness of our arguments and the effectiveness of our method. In both generic multi-label classification and partial-label multi-label classification, our proposed method consistently achieves promising results.

# 70

Strong Double Blind

Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch

Taemin Park · Hyuck Lee · Heeyoung Kim

Despite significant advancements in class-imbalanced semi-supervised learning (CISSL), many existing algorithms explicitly or implicitly assume that the class distribution of unlabeled data matches that of labeled data. However, when this assumption fails in practice, the classification performance of such algorithms may degrade due to incorrectly assigned weight to each class during training. We propose a novel CISSL algorithm called \textit{Rebalancing Using Estimated Class Distribution (RECD)}. RECD estimates the unknown class distribution of unlabeled data through Monte Carlo approximation, leveraging predicted class probabilities for unlabeled samples, and subsequently rebalances the classifier based on the estimated class distribution. Additionally, we propose an extension of feature clusters compression in the context of CISSL to mitigate feature map imbalance by densifying minority class clusters. Experimental results on four benchmark datasets demonstrate that RECD achieves state-of-the-art classification performance in CISSL.

# 64

Strong Double Blind

Foster Adaptivity and Balance in Learning with Noisy Labels

Mengmeng Sheng · Zeren Sun · Tao Chen · Shuchao Pang · yucheng wang · Yazhou Yao

Label noise is ubiquitous in real-world scenarios, posing a practical challenge to supervised models due to its effect in hurting the generalization performance of deep neural networks. Existing methods primarily employ the sample selection paradigm and usually rely on dataset-dependent prior knowledge (e.g., a pre-defined threshold) to cope with label noise, inevitably degrading the adaptivity. Moreover, existing methods tend to neglect the class balance in selecting samples, leading to biased model performance.To this end, we propose a simple yet effective approach named SED to deal with label noise in a Self-adaptivE and class-balanceD manner. Specifically, we first design a novel sample selection strategy to empower self-adaptivity and class balance when identifying clean and noisy data. A mean-teacher model is then employed to correct labels of noisy samples. Subsequently, we propose a self-adaptive and class-balanced sample re-weighting mechanism to assign different weights to detected noisy samples. Finally, we additionally employ consistency regularization on selected clean samples to improve model generalization performance. Extensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method.

# 53

Strong Double Blind

Self-Guided Generation of Minority Samples Using Diffusion Models

Soobin Um · Jong Chul Ye

We present a novel approach for generating minority samples that live on low-density regions of a data manifold. Our framework is built upon diffusion models, leveraging the principle of guided sampling that incorporates an arbitrary energy-based guidance during inference time. The key defining feature of our sampler lies in its self-contained nature, i.e., implementable solely with a pretrained model. This distinguishes our sampler from existing techniques that require expensive additional components (like external classifiers) for minority generation. Specifically, we first estimate the likelihood of features within an intermediate latent sample by evaluating a reconstruction loss w.r.t. its posterior mean. The generation then proceeds with the minimization of the estimated likelihood, thereby encouraging the emergence of minority features in the latent samples of subsequent timesteps. To further improve the performance of our sampler, we provide several time-scheduling techniques that properly manage the influence of guidance over inference steps. Experiments on benchmark real datasets demonstrate that our approach can greatly improve the capability of creating realistic low-likelihood minority instances over the existing techniques without the reliance on costly additional elements.

# 62

Strong Double Blind

Self-Cooperation Knowledge Distillation for Novel Class Discovery

Yuzheng Wang · Zhaoyu Chen · Dingkang Yang · Yunquan Sun · Lizhe Qi

Novel Class Discovery (NCD) aims to discover unknown and novel classes in an unlabeled set by leveraging knowledge already learned about known classes. Existing works focus on instance-level or class-level knowledge representation and build a shared representation space to achieve performance improvements. However, a long-neglected issue is the potential imbalanced number of samples from known and novel classes, pushing the model towards dominant classes. Therefore, these methods suffer from a challenging trade-off between reviewing known classes and discovering novel classes. Based on this observation, we propose a Self-Cooperation Knowledge Distillation (SCKD) method to utilize each training sample (whether known or novel, labeled or unlabeled) for both review and discovery. Specifically, the model’s feature representations of known and novel classes are used to construct two disjoint representation spaces. Through spatial mutual information, we design a self-cooperation learning to encourage model learning from the two feature representation spaces from itself. Extensive experiments on six datasets demonstrate that our method can greatly enhance baseline performance, achieving competitive performance.

# 72

Strong Double Blind

Non-Exemplar Domain Incremental Learning via Cross-Domain Concept Integration

Qiang Wang · Yuhang He · Songlin Dong · XINYUAN GAO · Shaokun Wang · Yihong Gong

Existing approaches to Domain Incremental Learning (DIL) address catastrophic forgetting by storing and rehearsing exemplars from old domains. However, exemplar-based solutions are not always viable due to data privacy concerns or storage limitations. Therefore, Non-Exemplar Domain Incremental Learning (NEDIL) has emerged as a significant paradigm for resolving DIL challenges. Current NEDIL solutions extend the classifier incrementally for new domains to learn new knowledge, but unrestricted extension within the same feature space leads to inter-class confusion. To tackle this issue, we propose a simple yet effective method through cross-domain concePt INtegrAtion (PINA). We train a Unified Classifier (UC) as a concept container across all domains. Then, a Domain Specific Alignment (DSA) module is trained for each incremental domain, aligning the feature distribution to the base domain. During inference, we introduce a Patch Shuffle Selector (PSS) to select appropriate parameters of DSA for test images. Our developed patch shuffling technique disrupts class-dependent information, outperforming the domain selectors based on K-Nearest Neighbors or Nearest Mean Classifier. Extensive experiments demonstrate that our method achieves state-of-the-art performance while reducing the number of additional parameters. The source code will be released in http://XXX.XXX.XX.

# 58

Strong Double Blind

Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams

Ziqiang Wang · Zhixiang Chi · Yanan Wu · Li Gu · Zhi Liu · Konstantinos Plataniotis · Yang Wang

Given a model trained on source data, Test-Time Adaptation (TTA) enables adaptation and inference in test data streams with domain shifts from the source. Current methods predominantly optimize the model for each incoming test data batch using self-training loss. While these methods yield commendable results in ideal test data streams, where batches are independently and identically sampled from the target distribution, they falter under more practical test data streams that are not independent and identically distributed (non-i.i.d.). The data batches in a non-i.i.d. stream display prominent label shifts relative to each other. It leads to conflicting optimization objectives among batches during the TTA process. Given the inherent risks of adapting the source model to unpredictable test-time distributions, we reverse the adaptation process and propose a novel Distribution Alignment loss for TTA. This loss guides the distributions of test-time features back towards the source distributions, which ensures compatibility with the well-trained source model and eliminates the pitfalls associated with conflicting optimization objectives. Moreover, we devise a domain shift detection mechanism to extend the success of our proposed TTA method in the continual domain shift scenarios. Our extensive experiments validate the logic and efficacy of our method. On six benchmark datasets, we surpass existing methods in non-i.i.d. scenarios and maintain competitive performance under the ideal i.i.d. assumption.

# 69

Few-shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt

Chenxi Liu · Zhenyi Wang · Tianyi Xiong · Ruibo Chen · Yihan Wu · junfeng guo · Heng Huang

Few-Shot Class-Incremental Learning (FSCIL) models aim to incrementally learn new classes with scarce samples while preserving knowledge of old ones. Existing FSCIL methods usually fine-tune the entire backbone, leading to overfitting and hindering the potential to learn new classes. On the other hand, recent prompt-based CIL approaches alleviate forgetting by training prompts with sufficient data in each task. In this work, we propose a novel framework named Attention-aware Self-adaptive Prompt (ASP). ASP encourages task-invariant prompts to capture shared knowledge by reducing specific information from the attention aspect. Additionally, self-adaptive task-specific prompts in ASP provide specific information and transfer knowledge from old classes to new classes with an Information Bottleneck learning objective. In summary, ASP prevents overfitting on base task and does not require enormous data in few-shot incremental tasks. Extensive experiments on three benchmark datasets validate that ASP consistently outperforms state-of-the-art FSCIL and prompt-based CIL methods in term of both learning new classes and mitigating forgetting.

# 68

Strong Double Blind

Exemplar-free Continual Representation Learning via Learnable Drift Compensation

Alex Gomez-Villa · Dipam Goswami · Kai Wang · Andrew Bagdanov · Bartlomiej Twardowski · Joost Van de Weijer

Exemplar-free class-incremental learning using a backbone trained from scratch and starting from a small first task presents a significant challenge for continual representation learning. Prototype-based approaches, when continually updated, face the critical issue of semantic drift due to which the old class prototypes drift to different positions in the new feature space. Through an analysis of forgetting in prototype-based continual learning, we show that forgetting is not due to diminished discriminative power of the feature extractor, and can potentially be corrected by drift compensation. To address this, we propose a novel feature drift correction method called Learnable Drift Compensation (LDC). LDC can effectively mitigate drift in any moving backbone, whether supervised or unsupervised. LDC is fast and straightforward to integrate on top of existing continual learning approaches. Finally, we showcase how LDC can be applied in combination with self-supervised CL methods, resulting in the first exemplar-free semi-supervised continual learning approach. We achieve state-of-the-art performance in both supervised and semi-supervised settings across multiple datasets.

# 74

Strong Double Blind

Open-World Dynamic Prompt and Continual Visual Representation Learning

Youngeun Kim · Jun Fang · Qin Zhang · Zhaowei Cai · Yantao Shen · Rahul Duggal · Dripta S. Raychaudhuri · Zhuowen Tu · Yifan Xing · Onkar Dabeer

The open world is inherently dynamic, characterized by its ever-evolving concepts and distributions. Continual learning (CL) in this dynamic open-world environment, where knowledge must be continuously acquired from data streams without forgetting, presents a significant challenge. Existing CL methods, whether rehearsal-free or rehearsal-based, often struggle to effectively generalize to unseen test-time classes in this open-world context. To address this challenge, we introduce a new practical CL setting tailored for open-world visual representation learning. In this setting, subsequent data streams systematically introduce novel classes that are disjoint from the classes seen in previous training phases, all the while remaining distinct from the unseen test classes. In response, we introduce Dynamic Prompt and Representation Learner (DPaRL), a simple yet effective Prompt-based CL (PCL) method. Our DPaRL learns to generate dynamic prompts for inference, as opposed to relying on a static prompt pool in previous PCL methods. In addition, DPaRL jointly learns the dynamic prompt generation and discriminative representation at each training stage whereas prior PCL methods only refine the prompt learning throughout the process. Our experimental results demonstrate the superiority of our approach, surpassing state-of-the-art methods on well-established open-world image retrieval benchmarks by an average of 4.7% improvement in Recall@1 performance.

# 42

Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks

MohammadReza Davari · Eugene Belilovsky

The rapid development of AI systems has been greatly influenced by the emergence of foundation models. A common approach for targeted problems involves fine-tuning these pre-trained foundation models for specific target tasks, resulting in a rapid spread of models fine-tuned across a diverse array of tasks. This work focuses on the problem of merging multiple fine-tunings of the same foundation model derived from a spectrum of auxiliary tasks. We introduce a new simple method, Model Breadcrumbs, which consists of a sparsely defined set of weights that carve out a trajectory within the weight space of a pre-trained model, enhancing task performance when traversed. These breadcrumbs are constructed by subtracting the weights from a pre-trained model before and after fine-tuning, followed by a sparsification process that eliminates weight outliers and negligible perturbations. Our experiments demonstrate the effectiveness of Model Breadcrumbs to simultaneously improve performance across multiple tasks. This contribution aligns with the evolving paradigm of updatable machine learning, reminiscent of the collaborative principles underlying open-source software development, fostering a community-driven effort to reliably update machine learning models. Our method is shown to be more efficient and unlike previous proposals does not require hyperparameter tuning for each new task added. Through extensive experimentation involving various models, tasks, and modalities we establish that integrating Model Breadcrumbs offers a simple, efficient, and highly effective approach for constructing multi-task models and facilitating updates to foundation models.

# 48

Strong Double Blind

Simple Unsupervised Knowledge Distillation With Space Similarity

Aditya Singh · Haohan Wang

As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller architectures. One direction to mitigate this shortcoming while simultaneously training a smaller network without labels is to adopt unsupervised knowledge distillation (UKD). Existing UKD approaches handcraft preservation worthy inter/intra sample relationships between the teacher and its student. However, this may overlook/ignore other key relationships present in the mapping of a teacher. In this paper, instead of heuristically constructing preservation worthy relationships between samples, we directly motivate the student to model the teacher's embedding manifold. If the mapped manifold is similar, all inter/intra sample relationships are indirectly conserved. We first demonstrate that prior methods cannot preserve teacher's latent manifold due to their sole reliance on $L_2$ normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbf{space similarity}, motivates each dimension of a student's feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks. \textbf{We will be releasing the source code to replicate our results along with weights for the trained models.}

# 34

Strong Double Blind

AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

Fadi Boutros · Vitomir Struc · Naser Damer

Knowledge distillation (KD) aims at improving the performance of a compact student model by distilling the knowledge from a high-performing teacher model. In this paper, we present an adaptive KD approach, namely AdaDistill, for deep face recognition. The proposed AdaDistill embeds the KD concept into softmax loss by training the student using a margin penalty softmax loss with distilled class centers from the teacher. Being aware of the relatively low capacity of the compact student model, we propose to distill relatively less complex knowledge at an early stage of training and more complex ones at a later stage of training. This relative adjustment of the distilled knowledge is controlled by the progression of the learning capability of the student over the training iterations without the need to tune any hyper-parameters. Extensive experiments and ablation studies prove that AdaDistill can enhance the discriminative learning capability of the student and demonstrate superiority over various state-of-the-art competitors on several challenging benchmarks such as IJB-B, IJB-C, and ICCV2021-MFR.

# 61

Dataset Growth

Ziheng Qin · zhaopan xu · YuKun Zhou · Kai Wang · Zangwei Zheng · Zebang Cheng · Hao Tang · Lei Shang · Baigui Sun · Radu Timofte · Xiaojiang Peng · Hongxun Yao · Yang You

Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.

# 51

Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation

Haizhong Zheng · Jiachen Sun · Shutong Wu · Bhavya Kailkhura · Zhuoqing Morley Mao · Chaowei Xiao · Prakash Atul

Given a real-world dataset, data condensation (DC) aims to synthesize a small synthetic dataset that captures the knowledge of a natural dataset while being usable for training models with comparable accuracy. Recent works propose to enhance DC with data parameterization, which condenses data into very compact parameterized data containers instead of images. By optimizing with an appropriate loss function, data parameterization methods can generate high-quality synthetic datasets and achieve improved model performance. Top-performing data parameterization methods use GPU memory intensive trajectory-based losses in their optimization. In this paper, we propose a novel data parameterization architecture, Hierarchical Memory Network (HMN), that achieves comparable or better performance to SOTA methods, even with a GPU memory friendly batch-based loss function. HMN's key insight is to directly capture sharing of features at both within-class level and across-class level by proposing a hierarchical parameterized architecture. We evaluate HMN on five public datasets and show that HMN outperforms current baselines (including those using trajectory-based losses), even when HMNs are trained with a GPU-friendly batch-based loss.

# 39

Strong Double Blind

MO-EMT-NAS: Multi-Objective Continuous Transfer of Architectural Knowledge Between Tasks from Different Datasets

Peng Liao · Xilu Wang · Yaochu Jin · Wenli Du

Deploying models across diverse devices demands tradeoffs among multiple objectives due to different resource constraints. Arguably, due to the small model trap problem in multi-objective neural architecture search (MO-NAS) based on a supernet, existing approaches may fail to maintain large models. Moreover, multi-tasking neural architecture search (MT-NAS) excels in handling multiple tasks simultaneously, but most existing efforts focus on tasks from the same dataset, limiting their practicality in real-world scenarios where multiple tasks may come from distinct datasets. To tackle the above challenges, we propose a Multi-Objective Evolutionary Multi-Tasking framework for NAS (MO-EMT-NAS) to achieve architectural knowledge transfer across tasks from different datasets while finding Pareto optimal architectures for multi-objectives, model accuracy and computational efficiency. To alleviate the small model trap issue, we introduce an auxiliary objective that helps maintain multiple larger models of similar accuracy. Moreover, the computational efficiency is further enhanced by parallelizing the training and validation of the weight-sharing-based supernet. Experimental results on seven datasets with two, three, and four task combinations show that MO-EMT-NAS achieves a better minimum classification error while being able to offer flexible trade-offs between model performance and complexity, compared to the state-of-the-art single-objective MT-NAS algorithms. In addition, the runtime of MO-EMT-NAS is reduced by 59.7% to 77.7%, compared to the corresponding multi-objective single-task approaches.

# 28

BAFFLE: A Baseline of Backpropagation-Free Federated Learning

Haozhe Feng · Tianyu Pang · Chao Du · Wei Chen · Shuicheng Yan · Min Lin

Federated learning (FL) is a general principle for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. Empirically we use BAFFLE to train deep models from scratch or to finetune pretrained models, achieving acceptable results.

# 38

Strong Double Blind

On the Evaluation Consistency of Attribution-based Explanations

Jiarui Duan · Haoling Li · Haofei Zhang · Hao Jiang · Mengqi Xue · Li Sun · Mingli Song · Jie Song

Attribution-based explanations are garnering increasing attention recently and have emerged as the predominant approach towards \textit{eXplanable Artificial Intelligence}~(XAI). However, the absence of consistent configurations and systematic investigations in prior literature impedes comprehensive evaluations of existing methodologies. In this work, we introduce {Meta-Rank}, an open platform for benchmarking attribution methods in the image domain. Presently, Meta-Rank assesses eight exemplary attribution methods using six renowned model architectures on four diverse datasets, employing both the \textit{Most Relevant First} (MoRF) and \textit{Least Relevant First} (LeRF) evaluation protocols.Through extensive experimentation, our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets. Our findings underscore the necessity for future research in this domain to conduct rigorous evaluations encompassing a broader range of models and datasets, and to reassess the assumptions underlying the empirical success of different attribution methods. Code, models, and datasets will be made publicly available in the near future.

# 24

Debiasing surgeon: fantastic weights and how to find them

Remi Nahon · Ivan Luiz De Moura Matos · Van-Tam Nguyen · Enzo Tartaglione

Nowadays an ever-growing concerning phenomenon, the emergence of algorithmic biases that can lead to unfair models, emerges. Several debiasing approaches have been proposed in the realm of deep learning, employing more or less sophisticated approaches to discourage these models from massively employing these biases. However, a question emerges: is this extra complexity really necessary? Is a vanilla-trained model already embodying some ``unbiased sub-networks'' that can be used in isolation and propose a solution without relying on the algorithmic biases? In this work, we show that such a sub-network typically exists, and can be extracted from a vanilla-trained model without requiring additional training. We further validate that such specific architecture is incapable of learning a specific bias, suggesting that there are possible architectural countermeasures to the problem of biases in deep neural networks.

# 47

Strong Double Blind

Auto-GAS: Automated Proxy Discovery for Training-free Generative Architecture Search

Lujun Li · Haosen SUN · Shiwen Li · Peijie Dong · Wenhan Luo · Wei Xue · Qifeng Liu · Yike Guo

In this paper, we introduce Auto-GAS, the first training-free Generative Architecture Search (GAS) framework enabled by an auto-discovered proxy. Generative models like Generative Adversarial Networks (GANs) are now widely used in many real-time applications. Previous GAS methods use differentiable or evolutionary search to find optimal GAN generators for fast inference and memory efficiency. However, the high computational overhead of these training-based GAS techniques limits their adoption. To improve search efficiency, we explore training-free GAS but find existing zero-cost proxies designed for classification tasks underperform on generation benchmarks. To address this challenge, we develop a custom proxy search framework tailored for GAS tasks to enhance predictive power. Specifically, we construct an information-aware proxy that takes feature statistics as inputs and utilizes advanced transform, encoding, reduction, and augment operations to represent candidate proxies. Then, we employ an evolutionary algorithm to perform crossover and mutation on superior candidates within the population based on correlation evaluation. Finally, we perform generator search without training using the optimized proxy. Thus, Auto-GAS enables automated proxy discovery for GAS while significantly accelerating the search before training stage. Extensive experiments on image generation and image-to-image translation tasks demonstrate that Auto-GAS strikes superior accuracy-speed tradeoffs over state-of-the-art methods. Remarkably, Auto-GAS achieves competitive scores with 110$\times$ faster search than GAN Compression. Codes are available in the Appendix.

# 31

Improving Adversarial Transferability via Model Alignment

Avery Ma · Amir-massoud Farahmand · Yangchen Pan · Philip Torr · Jindong Gu

Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric analysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.

# 50

Strong Double Blind

Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation

Bochao Liu · Pengju Wang · Shiming Ge

While the success of deep learning relies on large amounts of training datasets, data is often limited in privacy-sensitive domains. To address this challenge, generative model learning with differential privacy has emerged as a solution to train private generative models for desensitized data generation. However, the quality of the images generated by existing methods is limited due to the complexity of modeling data distribution. We build on the success of diffusion models and introduce DP-SAD, which trains a private diffusion model by a stochastic adversarial distillation method. Specifically, we first train a diffusion model as a teacher and then train a student by distillation, in which we achieve differential privacy by adding noise to the gradients from other models to the student. For better generation quality, we introduce a discriminator to distinguish whether an image is from the teacher or the student, which forms the adversarial training. Extensive experiments and analysis clearly demonstrate the effectiveness of our proposed method.

# 26

Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures

Sayanton Vhaduri Dibbo · Adam Breuer · Juston Moore · Michael Teti

Recent model inversion attack algorithms permit adversaries to reconstruct a neural network's private training data just by repeatedly querying the network and inspecting its outputs. In this work, we develop a novel network architecture that leverages sparse-coding layers to obtain superior robustness to this class of attacks. Three decades of computer science research has studied sparse coding in the context of image denoising, object recognition, and adversarial misclassification settings, but to the best of our knowledge, its connection to state-of-the-art privacy vulnerabilities remains unstudied. However, sparse coding architectures suggest an advantageous means to defend against model inversion attacks because they allow us to control the amount of irrelevant private information encoded in a network's intermediate representations in a manner that can be computed efficiently during training and that is known to have little effect on classification accuracy. Specifically, compared to networks trained with a variety of state-of-the-art defenses, our sparse-coding architectures maintain comparable or higher classification accuracy while degrading state-of-the-art training data reconstructions by factors of 1.1 to 18.3 across a variety of reconstruction quality metrics (PSNR, SSIM, FID). This performance advantage holds across 5 datasets ranging from CelebA faces to medical images and MNIST, and across various state-of-the-art SGD-based and GAN-based inversion attacks, including Plug-and-Play attacks. We provide a comprehensive performance evaluation codebase to encourage further research in this regard.

# 46

Strong Double Blind

CipherDM: Secure Three-Party Inference for Diffusion Model Sampling

Xin Zhao · Xiaojun Chen · Xudong Chen · He Li · Tingyu Fan · Zhendong Zhao

Diffusion Models (DMs) achieve state-of-the-art synthesis results in image generation and have been applied to various fields. However, DMs sometimes seriously violate user privacy during usage, making the protection of privacy an urgent issue. Using traditional privacy computing schemes like Secure Multi-Party Computation (MPC) directly in DMs faces significant computation and communication challenges. To address these issues, we propose CipherDM, the first novel, versatile and universal framework applying MPC technology to DMs for secure sampling, which can be widely implemented on multiple DM based tasks. We thoroughly analyze sampling latency breakdown, find the time-consuming part and design corresponding secure MPC protocols for computing nonlinear activations including SoftMax, SiLU and Mish. CipherDM is evaluated on popular architectures (DDPM, DDIM) using the MNIST dataset and on SD deployed by diffusers. Compared to direct implementation on SPU, our approach improves running time by approximately 1.084× ∼ 2.328×, and reduces communication costs by approximately 1.212× ∼ 1.791×. Code will be available upon paper acceptance.

# 27

Strong Double Blind

UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Siyuan Cheng · Guangyu Shen · Kaiyuan Zhang · Guanhong Tao · Shengwei An · Hanxi Guo · Shiqing Ma · Xiangyu Zhang

Deep neural networks (DNNs) have demonstrated effectiveness in various fields. However, DNNs are vulnerable to backdoor attacks, which inject a unique pattern, called trigger, in the input to cause misclassification to an attack-chosen target label. While existing works have proposed various methods to mitigate backdoor effects in poisoned models, they tend to be less effective against recent advanced attacks. In this paper, we introduce a novel post-training defense technique UNIT that can effectively remove backdoors for a variety of attacks. In specific, UNIT approximates a unique and tight activation distribution for each neuron in the model. It then proactively dispels substantially large activation values that exceed the approximated boundaries. Our experimental results demonstrate that UNIT outperforms 9 popular defense methods against 14 existing backdoor attacks, including 2 advanced attacks, using only 5\% of clean training data.

# 284

StereoGlue: Joint Feature Matching and Robust Estimation

Daniel Barath · Dmytro Mishkin · Luca Cavalli · Paul-Edouard Sarlin · Petr Hrubý · Marc Pollefeys

We propose StereoGlue, a method designed for joint feature matching and robust estimation that effectively reduces the combinatorial complexity of these tasks using single-point minimal solvers. StereoGlue is applicable to a range of problems, including but not limited to relative pose and homography estimation, determining absolute pose with 2D-3D correspondences, and estimating 3D rigid transformations between point clouds. StereoGlue starts with a set of one-to-many tentative correspondences, iteratively forms tentative matches, and estimates the minimal sample model. This model then facilitates guided matching, leading to consistent one-to-one matches, whose number serves as the model score. StereoGlue is superior to the state-of-the-art robust estimators on real-world datasets on multiple problems, improving upon a number of recent feature detectors and matchers. Additionally, it shows improvements in point cloud matching and absolute camera pose estimation. The code will be made publicly available.

# 270

Strong Double Blind

ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency

Shaocheng Yan · Pengcheng Shi · Jiayuan Li

Recent advances in point cloud registration mostly leverage geometric information. Although these methods have yielded promising results, they still struggle with problems of low overlap, thus limiting their practical usage. In this paper, we propose ML-SemReg, a plug-and-play point cloud registration framework that fully exploits semantic information. Our key insight is that mismatches can be categorized into two types, i.e., inter- and intra-class, after rendering semantic clues, and can be well addressed by utilizing multi-level semantic consistency. We first propose a Semantic Cluster Matching module to address inter-class mismatching, outputting multiple matching groups that inherently satisfy Neighborhood Semantic Consistency. For each group, a Semantic Mask Matching module based on Scene Semantic Consistency is then introduced to suppress intra-class mismatching. Benefit from those two modules, ML-SemReg generates correspondences with a high inlier ratio. Extensive experiments demonstrate excellent performance and robustness of ML-SemReg, e.g., in hard-cases of the KITTI dataset, the Registration Recall of MAC increases by almost 34 percentage points when our ML-SemReg is equipped. Code is available at \url{https://github.com/qwAyu/ML-SemReg}