Track: Poster Session 7

# 154

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

Vitali Petsiuk · Kate Saenko

Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models.

# 147

Strong Double Blind

Flatness-aware Sequential Learning Generates Resilient Backdoors

Hoang Pham · The-Anh Ta · Anh Tran · Khoa Doan

Recently, backdoor attacks have become an emerging threat to the security of machine learning models. From the adversary's perspective, the implanted backdoors should be resistant to defensive algorithms, but some recently proposed fine-tuning defenses can remove these backdoors with notable efficacy. This is mainly due to the catastrophic forgetting (CF) property of deep neural networks. This paper counters CF of backdoors by leveraging continual learning (CL) techniques. We begin by investigating the connectivity between a backdoored and fine-tuned model in the loss landscape. Our analysis confirms that fine-tuning defenses, especially the more advanced ones, can easily push a poisoned model out of the backdoor regions, making it forget all about the backdoors. Based on this finding, we re-formulate backdoor training through the lens of CL and propose a novel framework, named \textbf{S}equential \textbf{B}ackdoor \textbf{L}earning (\textbf{SBL}), that can generate resilient backdoors. This framework separates the backdoor poisoning process into two tasks: the first task learns a backdoored model, while the second task, based on the CL principles, moves it to a backdoored region resistant to fine-tuning. We additionally propose to seek flatter backdoor regions via a sharpness-aware minimizer in the framework, further strengthening the durability of the implanted backdoor. Finally, we demonstrate the effectiveness of our method through extensive empirical experiments on several benchmark datasets in the backdoor domain.

# 183

Strong Double Blind

Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks

Jiacheng Cheng · Xiang Dai · Jia Wan · Nick Antipa · Nuno Vasconcelos

The problem of designing a privacy-preserving camera (PPC) is considered. Previous designs rely on a static point spread function (PSF), optimized to prevent detection of private visual information, such as recognizable facial features. However, the PSF can be easily recovered by measuring the camera response to a point light source, making these cameras vulnerable to PSF inversion attacks. A new dynamic privacy preserving (DyPP) camera design is proposed to prevent such attacks. DyPPcameras rely on dynamic optical elements, such spatial light modulators, to implement a time-varying PSF, which changes from picture to picture. PSFs are drawn randomly with a learned manifold embedding, trained adversarially to simultaneously meet user-specified targets for privacy, such as face recognition accuracy, and task utility. Empirical evaluations on multiple privacy-preserving vision tasks demonstrate that the DyPP design is significantly more robust to PSF inversion attacks than previous PPCs. Furthermore, the hardware feasibility of the approach is validated by a proof-of-concept camera model.

# 156

Strong Double Blind

Adversarial Robustification via Text-to-Image Diffusion Models

Daewon Choi · Jongheon Jeong · Huiwon Jang · Jinwoo Shin

Adversarial robustness has been conventionally believed as a challenging property to encode for neural networks, requiring plenty of training data. In the recent paradigm of adopting off-the-shelf models, however, access to their training data is often infeasible or not practical, while most of such models are not originally trained concerning adversarial robustness. In this paper, we develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data. Our intuition is to view recent text-to-image diffusion models as ``adaptable'' denoisers that can be optimized to specify target tasks. Based on this, we propose: (a) to initiate a denoise-and-classify pipeline that offers provable guarantees against adversarial attacks, and (b) to leverage a few synthetic reference images generated from the text-to-image model that enables novel adaptation schemes. Our experiments show that our data-free scheme applied to the pre-trained CLIP could improve the (provable) adversarial robustness of its diverse zero-shot classification derivatives (while maintaining their accuracy), significantly surpassing prior approaches that utilize the full training data. Not only for CLIP, we also demonstrate that our framework is easily applicable for robustifying other visual classifiers efficiently.

# 169

Strong Double Blind

Privacy-Preserving Adaptive Re-Identification without Image Transfer

Hamza Rami · Jhony H. Giraldo · Nicolas Winckler · Stéphane Lathuiliere

Re-Identification systems (Re-ID) are crucial for public safety but face the challenge of having to adapt to environments that differ from their training distribution. Furthermore, rigorous privacy protocols in public places are being enforced as apprehensions regarding individual freedom rise, adding layers of complexity to the deployment of accurate Re-ID systems in new environments. For example, in the European Union, the principles of "Data Minimization" and "Purpose Limitation" restrict the retention and processing of images to what is strictly necessary. These regulations pose a challenge to the conventional Re-ID training schemes that rely on centralizing data on servers. In this work, we present a novel setting for privacy-preserving Distributed Unsupervised Domain Adaptation for person Re-ID (DUDA-Rid) to address the problem of domain shift without requiring any image transfer outside the camera devices. To address this setting, we introduce Fed-Protoid, a novel solution that adapts person Re-ID models directly within the edge devices. Our proposed solution employs prototypes derived from the source domain to align feature statistics within edge devices. Those source prototypes are distributed across the edge devices to minimize a distributed Maximum Mean Discrepancy (MMD) loss tailored for the DUDA-Rid setting. Our experiments provide compelling evidence that Fed-Protoid outperforms all evaluated methods in terms of both accuracy and communication efficiency, all while maintaining data privacy.

# 155

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

Changhoon Kim · Kyle Min · Yezhou Yang

In the evolving landscape of text-to-image (T2I) diffusion models, the remarkable capability to generate high-quality images from textual descriptions faces challenges with the potential misuse of reproducing sensitive content. To address this critical issue, we introduce \textbf{R}obust \textbf{A}dversarial \textbf{C}oncept \textbf{E}rase (RACE), a novel approach designed to mitigate these risks by enhancing the robustness of concept erasure method for T2I models. RACE utilizes a sophisticated adversarial training framework to identify and mitigate adversarial text embeddings, significantly reducing the Attack Success Rate (ASR). Impressively, RACE achieves a 30\% reduction in ASR for the ``nudity'' concept against the leading white-box attack method. Our extensive evaluations demonstrate RACE's effectiveness in defending against both white-box and black-box attacks, marking a significant advancement in protecting T2I diffusion models from generating inappropriate or misleading imagery. This work underlines the essential need for proactive defense measures in adapting to the rapidly advancing field of adversarial challenges.

# 13

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li · hangyu guo · Kun Zhou · Wayne Xin Zhao · Ji-Rong Wen

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data will be publicly released.

# 146

Strong Double Blind

A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks

Yixiang Qiu · Hao Fang · Hongyao Yu · Bin Chen · Meikang Qiu · Shu-Tao Xia

Model Inversion (MI) attacks aim to reconstruct privacy-sensitive training data from released models by utilizing output information, raising extensive concerns about the security of Deep Neural Networks (DNNs). Recent advances in generative adversarial networks (GANs) have contributed significantly to the improved performance of MI attacks due to their powerful ability to generate realistic images with high fidelity and appropriate semantics. However, previous MI attacks have solely disclosed private information in the latent space of GAN priors, limiting their semantic extraction and transferability across multiple target models and datasets. To address this challenge, we propose a novel method, \textbf{I}ntermediate \textbf{F}eatures enhanced \textbf{G}enerative \textbf{M}odel \textbf{I}nversion (IF-GMI), which disassembles the GAN structure and exploits features between intermediate blocks. This allows us to extend the optimization space from latent code to intermediate features with enhanced expressive capabilities. To prevent GAN priors from generating unrealistic images, we apply a ${l}_1$ ball constraint to the optimization process. Experiments on multiple benchmarks demonstrate that our method significantly outperforms previous approaches and achieves state-of-the-art results under various settings, especially in the out-of-distribution (OOD) scenario.

# 151

Strong Double Blind

Spline-based Transformers

Prashanth Chandran · Agon Serifi · Markus Gross · Moritz Bächer

We introduce Spline-based Transformers, a novel class of Transformer models that eliminate the need for positional encoding. Inspired by workflows using splines in computer animation, our Spline-based Transformer embeds an input sequence of elements as a smooth trajectory in latent space. Overcoming drawbacks of positional encoding such as sequence length extrapolation, Spline-based Transformers also provide a novel way for users to interact with transformer latent spaces by directly manipulating the latent control points to create new latent trajectories and sequences. We demonstrate the superior performance of our approach in comparison to conventional positional encoding on a variety of datasets, ranging from synthetic 2D to large-scale real-world datasets of images, 3D shapes, and animations.

# 111

Strong Double Blind

Anytime Continual Learning for Open Vocabulary Classification

Zhen Zhu · Yiming Gong · Derek Hoiem

We propose an approach for anytime continual learning (AnytimeCL) for open vocabulary image classification. The AnytimeCL problem aims to break away from batch training and rigid models by requiring that a system can predict any set of labels at any time and efficiently update and improve when receiving one or more training samples at any time. Despite the challenging goal, we achieve substantial improvements over recent methods. We propose a dynamic weighting between predictions of a partially fine-tuned model and a fixed open vocabulary model that enables continual improvement when training samples are available for a subset of a task's labels. We also propose an attention-weighted PCA compression for compression of training features that reduces storage and computation with little impact to model accuracy. Our methods are validated with experiments that test flexibility of learning and inference.

# 105

Weighted Ensemble Models Are Strong Continual Learners

Imad Eddine Marouf · Subhankar Roy · Enzo Tartaglione · Stéphane Lathuiliere

In this work, we study the problem of continual learning (CL) where the goal is to learn a model on a sequence of tasks, such that the data from the previous tasks becomes unavailable while learning on the current task data. CL is essentially a balancing act between being able to learn on the new task (i.e., plasticity) and maintaining the performance on the previously learned concepts (i.e., stability). Intending to address the stability-plasticity trade-off, we propose to perform weight-ensembling of the model parameters of the previous and current tasks. This weighted-ensembled model, which we call Continual Model Averaging (or CoMA), attains high accuracy on the current task by leveraging plasticity, while not deviating too far from the previous weight configuration, ensuring stability. We also propose an improved variant of CoMA, named Continual Fisher-weighted Model Averaging (or CoFiMA), that selectively weighs each parameter in the weights ensemble by leveraging the Fisher information of the weights of the model. Both variants are conceptually simple, easy to implement, and effective in attaining state-of-the-art performance on several standard CL benchmarks. The code is attached to the paper submission.

# 96

Strong Double Blind

COD: Learning Conditional Invariant Representation for Domain Adaptation Regression

Hao-Ran Yang · Chuan-Xian Ren · You-Wei Luo

Aiming to generalize the label knowledge from a source domain with continuous outputs to an unlabeled target domain, Domain Adaptation Regression (DAR) is developed for complex practical learning problems. However, due to the continuity problem in regression, existing conditional distribution alignment theory and methods with discrete prior, which are proven to be effective in classification settings, are no longer applicable. In this work, focusing on the feasibility problems in DAR, we establish the sufficiency theory for the regression model, which shows the generalization error can be sufficiently dominated by the cross-domain conditional discrepancy. Further, to characterize conditional discrepancy with continuous conditioning variable, a novel Conditional Operator Discrepancy (COD) is proposed, which admits the metric property on conditional distributions via the kernel embedding theory. Finally, to minimize the discrepancy, a COD-based conditional invariant representation learning model is proposed, and the reformulation is derived to show that reasonable modifications on moment statistics can further improve the discriminability of the adaptation model. Extensive experiments on standard DAR datasets verify the validity of theoretical results and the superiority over SOTA DAR methods.

# 85

On the Topology Awareness and Generalization Performance of Graph Neural Networks

Junwei Su · Chuan Wu

Many computer vision and machine learning problems are modelled as learning tasks on graphs, where graph neural networks (GNNs) have emerged as a dominant tool for learning representations of graph-structured data. A key feature of GNNs is their use of graph structures as input, enabling them to exploit the graphs' inherent topological properties—known as the topology awareness of GNNs. Despite the empirical successes of GNNs, the influence of topology awareness on generalization performance remains unexplored, particularly for node-level tasks that diverge from the assumption of data being independent and identically distributed (I.I.D.). The precise definition and characterization of the topology awareness of GNNs, especially concerning different topological features, are still unclear. This paper introduces a comprehensive framework to characterize the topology awareness of GNNs across any topological feature. Using this framework, we investigate the effects of topology awareness on GNN generalization performance. Contrary to the prevailing belief that enhancing the topology awareness of GNNs is always advantageous, our analysis reveals a critical insight: improving the topology awareness of GNNs may inadvertently lead to unfair generalization across structural groups, which might not be desired in some scenarios. Additionally, we conduct a case study using the intrinsic graph metric, the shortest-path distance, on various benchmark datasets. The empirical results of this case study confirm our theoretical insights. Moreover, we demonstrate the practical applicability of our framework by using it to tackle the cold start problem in graph active learning.

# 104

Strong Double Blind

Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning

Qihao Zhao · YALUN DAI · Shen Lin · Wei Hu · Fan Zhang · Jun Liu

In real-world scenarios, where knowledge distributions exhibit long-tail. Humans manage to master knowledge uniformly across imbalanced distributions, a feat attributed to their diligent practices of reviewing, summarizing, and correcting errors. Motivated by this learning process, we propose a novel learning paradigm, called reflecting learning, in handling long-tail recognition. Our method integrates three processes for reviewing past predictions during training, summarizing and leveraging the feature relation across classes, and correcting gradient conflict for loss functions. These designs are lightweight enough to plug and play with existing long-tail learning methods, achieving state-of-the-art performance in popular long-tail visual benchmarks. The experimental results highlight the great potential of reflecting learning in dealing with long-tail recognition. Our code will be open-sourced upon acceptance.

# 110

Model Stock: All we need is just a few fine-tuned models

Dong-Hwan Jang · Sangdoo Yun · Dongyoon Han

This paper introduces a novel fine-tuning method for large pre-trained models, offering strong performance with further efficiency. Breaking away from traditional practices that average a multitude of fine-tuned models for accuracy improvements, our approach uses significantly fewer models to optimize final weights yet achieve superior accuracy. Based on the crucial observations of the dynamics in fine-tuned models' weight space, our novel layer-wise averaging technique could surpass state-of-the-art model averaging methods such as Model Soup only with just two fine-tuned models. This strategy can be more aptly coined like Model Stock, reflecting its reliance on selecting very few models to draw a more optimized-averaged model. We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures, achieving remarkable performance on both in-distribution (ID) and out-of-distribution (OOD) tasks on the standard benchmarks, all while barely bringing extra computational demands. Our code and pre-trained models will be made publicly available.

# 187

Strong Double Blind

A Direct Approach to Viewing Graph Solvability

Federica Arrigoni · Andrea Fusiello · Tomas Pajdla

The viewing graph is a useful way to represent uncalibrated cameras and their geometric relationships: nodes correspond to cameras and edges represent fundamental matrices. By analyzing this graph, it is possible to establish if the problem is "solvable" in the sense that there exists a unique (up to a single projective transformation) set of cameras that are compliant with the given fundamental matrices. In this paper, we take several steps forward in the study of viewing graph solvability: we propose a new formulation of the problem that is more direct than previous literature, based on a formula that explicitly links pairs of cameras via their fundamental matrix; we introduce the new concept of "infinitesimal solvability", demonstrating its usefulness in understanding real structure from motion graphs; we propose an algorithm for testing infinitesimal solvability and extracting components of unsolvable cases, that is more efficient than previous work; we set up an open research question on the connection between infinitesimal solvability and solvability.

# 152

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Denis Zavadski · Johann-Friedrich Feiden · Carsten Rother

The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. In state-of-the-art approaches, this guidance is realized by a separate controlling model that controls a pre-trained image generation network, such as a latent diffusion model. Understanding this process from a control system perspective shows that it forms a feedback-control system, where the control module receives a feedback signal from the generation process and sends a corrective signal back. When analysing existing systems, we observe that the feedback signals are timely sparse and have a small number of bits. As a consequence, there can be long delays between newly generated features and the respective corrective signals for these features. It is known that this delay is the most unwanted aspect of any control system. In this work, we take an existing controlling network (ControlNet) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth. By doing so, we are able to considerably improve the quality of the generated images, as well as the fidelity of the control. Also, the controlling network needs noticeably fewer parameters and hence is about twice as fast during inference and training time. Another benefit of small-sized models is that they help to democratise our field and are likely easier to understand. We call our proposed network ControlNet-XS. When comparing with the state-of-the-art approaches, we outperform them for pixel-level guidance, such as depth, canny-edges, and semantic segmentation, and are on a par for loose keypoint-guidance of human poses. All code and pre-trained models will be made publicly available.

# 149

Strong Double Blind

A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures

Tahmina Khanam · Mohammed Bennamoun · Guan Wang · Guanjin Wang · Ferdous Sohel · Farid Boussaid · Anuj Srivastava · Hamid Laga

We propose the first comprehensive approach for modeling and analyzing the spatiotemporal shape variability in tree-like 4D objects, i.e., 3D objects whose shapes bend, stretch and change in their branching structure over time as they deform, grow, and interact with their environment. Our key contribution is the representation of tree-like 3D shapes using Square Root Velocity Function Trees (SRVFT). By solving the spatial registration in the SRVFT space, which is equipped with an $\ltwo$ metric, 4D tree-shaped structures become time-parameterized trajectories in this space. This reduces the problem of modeling and analyzing 4D tree-like shapes to that of modeling and analyzing elastic trajectories in the SRVFT space, where elasticity refers to time warping. In this paper, we propose a novel mathematical representation of the shape space of such trajectories, a Riemannian metric on that space, and computational tools for fast and accurate spatiotemporal registration and geodesics computation between 4D tree-shaped structures. Leveraging these building blocks, we develop a full framework for modelling the spatiotemporal variability using statistical models and generating novel 4D tree-like structures from a set of exemplars. We demonstrate and validate the proposed framework using real 4D plant data. The code will be available on Github.

# 249

Strong Double Blind

Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering

Benjamin Attal · Dor Verbin · Ben Mildenhall · Peter Hedman · Jonathan T Barron · Matthew O'Toole · Pratul Srinivasan

State-of-the-art techniques for 3D reconstruction are largely based on volumetric scene representations, which require sampling multiple points to compute the color arriving along a ray. Using these representations for more general inverse rendering --- reconstructing geometry, materials, and lighting from observed images --- is challenging because recursively path-tracing such volumetric representations is prohibitively expensive. Recent works alleviate this issue through the use of radiance caches: data structures that store the steady-state, infinite-bounce radiance arriving at any point from any direction. However, these solutions rely on approximations that introduce bias into the renderings and, more importantly, into the gradients used for optimization. We present a method that avoids these approximations, while remaining computationally efficient. In particular, we leverage two techniques to reduce variance for unbiased estimators of the rendering equation: (1) a trainable, occlusion-aware importance sampler for incoming illumination and (2) a fast cache architecture that can be used as a control variate for the radiance from a high-quality, but more expensive, volumetric cache. We show that our approach, and removing these biases, improves the quality of recovered geometry and materials, especially in the presence of effects like specular reflections.

# 247

Strong Double Blind

Shape from Heat Conduction

Sriram Narayanan · Mani Ramanagopal · Mark Sheinin · Aswin C. Sankaranarayanan · Srinivasa G. Narasimhan

Thermal cameras measure the temperature of objects based on radiation emitted in the infrared spectrum. In this work, we propose a novel shape recovery approach that exploits the properties of heat transport, specifically heat conduction, induced on objects when illuminated using simple light bulbs. While the resulting heat transport occurs in the entirety of an object's volume, we show a surface approximation that enables shape recovery and empirically analyze its validity for objects with varying thicknesses. We develop an algorithm that solves a linear system of equations to estimate the intrinsic shape Laplacian for the first time from thermal videos along with several properties including heat capacity, convection coefficient, and absorbed heat flux under uncalibrated lighting of arbitrary shapes. Further, we propose a novel shape from Laplacian objective that aims to resolve the inherent shape ambiguities by drawing insights from absorbed heat flux images using two unknown lights sources. Finally, we devise a coarse-to-fine refinement strategy that faithfully recovers low- and high-frequency shape details. We validate our method by showing accurate reconstructions, to within an error of 1-2 mm, in both simulations and from noisy thermal videos of real-world objects with complex shapes and materials.

# 252

Rasterized Edge Gradients: Handling Discontinuities Differentially

Stanislav Pidhorskyi · Tomas Simon · Gabriel Schwartz · He Wen · Yaser Sheikh · Jason Saragih

Computing the gradients of a rendering process is paramount for diverse applications in computer vision and graphics. However, accurate computation of these gradients is challenging due to discontinuities and rendering approximations, particularly for surface-based representations and rasterization-based rendering. We present a novel method for computing gradients at visibility discontinuities for rasterization-based differentiable renderers. Our method elegantly simplifies the traditionally complex problem through a carefully designed approximation strategy, allowing for a straightforward, effective, and performant solution. We introduce a novel concept of micro-edges, which allows us to treat the rasterized images as outcomes of a differentiable, continuous process aligned with the inherently non-differentiable, discrete-pixel rasterization. This technique eliminates the necessity for rendering approximations or other modifications to the forward pass, preserving the integrity of the rendered image, which makes it applicable to rasterized masks, depth, and normals images where filtering is prohibitive. Utilizing micro-edges simplifies gradient interpretation at discontinuities and enables handling of geometry intersections, offering an advantage over the prior art. We showcase our method in dynamic human head scene reconstruction, demonstrating effective handling of camera images and segmentation masks.

# 153

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

Seung Hyun Lee · Yinxiao Li · Junjie Ke · Innfarn Yoo · Han Zhang · Jiahui Yu · Qifei Wang · Fei Deng · Glenn Entis · Junfeng He · Gang Li · Sangpil Kim · Irfan Essa · Feng Yang

Recent reinforcement learning (RL) works have demonstrated that using multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt in inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

# 157

Strong Double Blind

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Xiang Zhang · Yulun Zhang · Fisher Yu

Transformers have exhibited promising performance in computer vision tasks including image super-resolution (SR). However, popular transformer-based SR methods often employ window self-attention with quadratic computational complexity to window sizes, resulting in fixed small windows with limited receptive fields. In this paper, we present a general strategy to convert transformer-based SR networks to hierarchical transformers (HiT-SR), boosting SR performance with multi-scale features while maintaining an efficient design. Specifically, we first replace the commonly used fixed small windows with expanding hierarchical windows to aggregate features at different scales and establish long-range dependencies. Considering the intensive computation required for large windows, we further design a spatial-channel correlation method with linear complexity to window sizes, efficiently gathering spatial and channel information from hierarchical windows. Extensive experiments verify the effectiveness and efficiency of our HiT-SR, and our improved versions of SwinIR-Light, SwinIR-NG, and SRFormer-Light yield state-of-the-art SR results with fewer parameters, FLOPs, and faster speeds ($\sim7\times$).

# 284

Strong Double Blind

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Dongze Li · Kang Zhao · WEI WANG · Yifeng Ma · Bo Peng · Yingya Zhang · Jing Dong

Talking head synthesis is a practical technique with wide applications. Current Neural Radiance Field (NeRF) based approaches have shown their superiority on driving one-shot talking heads with videos or signals regressed from audio. However, most of them failed to take the audio as driven information directly, unable to enjoy the flexibility and availability of speech. Since mapping audio signals to face deformation is non-trivial, we design a Single-Shot Speech-Driven Neural Radiance Field (S^3D-NeRF) method in this paper to tackle the following three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area. To this end, we introduce a Hierarchical Facial Appearance Encoder to learn multi-scale representations for catching the appearance of different speakers, and elaborate a Cross-modal Facial Deformation Field to perform speech animation according to the relationship between the audio signal and different face regions. Moreover, to enhance the temporal consistency of the important lip area, we introduce a lip-sync discriminator to penalize the out-of-sync audio-visual sequences. Extensive experiments have shown that our S^3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization

# 273

Loc3Diff: Local Diffusion for 3D Human Head Synthesis and Editing

Yushi Lan · Feitong Tan · Qiangeng Xu · Di Qiu · Kyle Genova · Zeng Huang · Rohit Pandey · Sean Fanello · Thomas Funkhouser · Chen Change Loy · Yinda Zhang

We present a novel framework for generating photo-realistic 3D human head and subsequently manipulating and reposing them with remarkable flexibility. The proposed approach constructs an implicit representation of 3D human heads, anchored on a parametric face model. To enhance representational capabilities and encode spatial information, we represent semantic consistent head region by a local triplane, modulated by a 3D Gaussian. Additionally, we parameterize these tri-planes in a 2D UV space via a 3DMM, enabling effective utilization of the diffusion model for 3D head avatar generation. Our method facilitates the creation of diverse and realistic 3D human heads with flexible global and fine-grained region-based editing over facial structures, appearance and expressions. Extensive experiments demonstrate the effectiveness of our method.

# 274

Strong Double Blind

PAV: Personalized Head Avatar from Unstructured Video Collection

Akin Caliskan · Berkay Kicanaoglu · H K

We propose PAV, Personalized Head Avatar for the synthesis of human faces under arbitrary viewpoints and facial expressions. PAV introduces a method that learns a dynamic deformable neural radiance field (NeRF), in particular from a collection of monocular talking face videos of the same character under various appearance and shape changes. Unlike existing head NeRF methods that are limited to modeling such input videos on a per-appearance basis, our method allows for learning multi-appearance NeRFs, introducing appearance embedding for each input video via learnable latent neural features attached to the underlying geometry. Furthermore, the proposed appearance-conditioned density formulation facilitates the shape variation of the character, such as facial hair and soft tissues, in the radiance field prediction. To the best of our knowledge, our approach is the first dynamic deformable NeRF framework to model appearance and shape variations in a single unified network for multi-appearances of the same subject. We demonstrate experimentally that PAV outperforms the baseline method in terms of visual rendering quality in our extensive quantitative and qualitative studies on various subjects.

# 275

Strong Double Blind

Expressive Whole-Body 3D Gaussian Avatar

Gyeongsik Moon · Takaaki Shiratori · Shunsuke Saito

Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand motions. In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses. Code and pre-trained weights will be publicly available.

# 277

High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering

Xin Ming · Jiawei Li · Jingwang Ling · Libo Zhang · Feng Xu

Readily editable mesh blendshapes have been widely used in animation pipelines, while recent advancements in neural geometry and appearance representations have enabled high-quality inverse rendering. Building upon these observations, we introduce a novel technique that reconstructs mesh-based blendshape rigs from single or sparse multi-view videos, leveraging state-of-the-art neural inverse rendering. We begin by constructing a deformation representation that parameterizes vertex displacements into differential coordinates with tetrahedral connections, allowing for high-quality vertex deformation on high-resolution meshes. By constructing a set of semantic regulations in this representation, we achieve joint optimization of blendshapes and expression coefficients. Furthermore, to enable a user-friendly multi-view setup with unsynchronized cameras, we use a neural regressor to model time-varying motion parameters. Experiments demonstrate that, with the flexible input of single or sparse multi-view videos, we reconstruct personalized high-fidelity blendshapes. These blendshapes are both geometrically and semantically accurate, and they are compatible with industrial animation pipelines. Code and data are available at https://github.com/grignarder/high-quality-blendshape-generation.

# 217

Strong Double Blind

Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement

Lingyu Zhu · Wenhan Yang · Baoliang Chen · Hanwei Zhu · Zhangkai Ni · Qi Mao · Shiqi Wang

Obtaining pairs of low/normal-light videos, with motions, is more challenging than still images, which raises technical issues and poses the technical route of unpaired learning as a critical role. This paper makes endeavors in the direction of learning for low-light video enhancement without using paired ground truth. Compared to low-light image enhancement, enhancing low-light videos is more difficult due to the intertwined effects of noise, exposure, and contrast in the spatial domain, jointly with the need for temporal coherence. To address the above challenge, we propose the Unrolled Decomposed Unpaired Network (UDU-Net) for enhancing low-light videos by unrolling the optimization functions into a deep network to decompose the signal into spatial and temporal-related factors, which are updated iteratively. Firstly, we formulate low-light video enhancement as a Maximum A Posteriori estimation (MAP) problem with carefully designed spatial and temporal visual regularization. Then, via unrolling the problem, the optimization of the spatial and temporal constraints can be decomposed into different steps and updated in a stage-wise manner. From the spatial perspective, we design the Intra subnet, which leverages unpair prior information from expert photography retouched skills to adjust the statistical distribution. Additionally, we introduce a novel mechanism that integrates human perception feedback to guide network optimization, suppressing over/under-exposure conditions. Meanwhile, to address the issue from the temporal perspective, the designed Inter subnet fully exploits temporal cues in progressive optimization, which helps achieve improved temporal consistency in enhancement results. Consequently, the proposed method achieves superior performance to state-of-the-art methods in video illumination, noise suppression, and temporal consistency across outdoor and indoor scenes. Our code will be available at https://github.com/xxxx.

# 220

Image Demoireing in RAW and sRGB Domains

Shuning Xu · Binbin Song · Xiangyu Chen · Xina Liu · Jiantao Zhou

Moiré patterns frequently appear when capturing screens with smartphones or cameras, potentially compromising image quality. Previous studies suggest that moiré pattern elimination in the RAW domain offers greater effectiveness compared to demoiréing in the sRGB domain. Nevertheless, relying solely on RAW data for image demoiréing is insufficient in mitigating the color cast due to the absence of essential information required for the color correction by the image signal processor (ISP). In this paper, we propose to jointly utilize both RAW and sRGB data for image demoiréing (RRID), which are readily accessible in modern smartphones and DSLR cameras. We develop Skip-Connection-based Demoiréing Module (SCDM) with Gated Feedback Module (GFM) and Frequency Selection Module (FSM) embedded in skip-connections for the efficient and effective demoiréing of RAW and sRGB features, respectively. Subsequently, we design a RGB Guided ISP (RGISP) to learn a device-dependent ISP, assisting the process of color recovery. Extensive experiments demonstrate that our RRID outperforms state-of-the-art approaches, in terms of the performance in moiré pattern removal and color cast correction by 0.62dB in PSNR and 0.003 in SSIM.

# 251

Strong Double Blind

Multiscale Sliced Wasserstein Distances as Perceptual Color Difference Measures

Jiaqi He · Zhihua Wang · Leon Wang · Tsein-I Liu · Yuming Fang · Qilin Sun · Kede Ma

Many contemporary perceptual color difference (CD) metrics operate at the pixel level, entailing per-pixel difference calculation and a subsequent global average operation to ascertain the overall CD. Nonetheless, these metrics inadequately deliver precise CD assessments for misaligned photographic image pairs, particularly in cases involving disparities in image layout and object position. In this paper, we leverage the Sliced Wasserstein Distance to formulate a novel perceptual CD metric that holistically assesses images, specifically tailored for image pairs that are not perfectly aligned. To enhance adaptability to varying image resolutions and viewing conditions, such as display resolution and viewing distance, the proposed metric operates on multiple scales of images. Our method is conceptually straightforward and does not necessitate a training process. Quantitative and qualitative experiments demonstrate that our metric achieves a state-of-the-art performance in assessing CDs for non-aligned image pairs, displaying a high degree of agreement with human visual perception. We also conducted additional tests of our metric in image colorization and video color transfer tasks. The experimental results indicate that our metric effectively emphasizes color information while accommodating variations in content.

# 232

Strong Double Blind

Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy

Fadlullah Raji · John Murray-Bruce

Conventional imaging requires a line of sight to create accurate visual representations of a scene. In certain circumstances, however, obtaining a suitable line of sight may be impractical, dangerous, or even impossible. Non-line-of-sight (NLOS) imaging addresses this challenge by reconstructing the scene from indirect measurements. Recently, passive NLOS methods that use an ordinary photograph of the subtle shadow cast onto a visible wall by the hidden scene have gained interest. These methods are currently limited to 1D or low-resolution 2D color imaging, or localization of a hidden object whose shape is approximately known. Here, we generalize this class of methods and demonstrate a 3D reconstruction of a hidden scene from an ordinary NLOS photograph. To achieve this, we propose a novel reformulation of the light transport model that conveniently decomposes the hidden scene into light-occluding and non-light-occluding components to yield a separable non-linear least squares (SNLLS) inverse problem for reconstructing the hidden scene. We develop two solutions: A gradient-based optimization method and a physics-inspired neural network approach, which we call Soft Shadow diffusion (SSD). Despite the challenging ill-conditioned inverse problem encountered here, our approaches are effective on numerous 3D scenes in real experimental scenarios. Although SSD is trained in simulation only, it generalizes well to both unseen classes and real-world NLOS scenes.

# 234

Strong Double Blind

Single-Mask Inpainting for Voxel-based Neural Radiance Fields

Jiafu Chen · Tianyi Chu · Jiakai Sun · Wei Xing · Lei Zhao

3D inpainting is a challenging task in computer vision and graphics that aims to remove objects and fill in missing regions with a visually coherent and complete representation of the background. A few methods have been proposed to address this problem, yielding notable results in inpainting. However, these methods haven’t perfectly solved the limitation of relying on masks for each view. Obtaining masks for each view can be time-consuming and reduces quality, especially in scenarios with a large number of views or complex scenes. To address this limitation, we propose an innovative approach that eliminates the need for per-view masks and uses a single mask from a selected view. We focus on improving the quality of forward-facing scene inpainting. By unprojecting the single 2D mask into the NeRFs space, we define the regions that require inpainting in three dimensions. We introduce a two-step optimization process. Firstly, we utilize 2D inpainters to generate color and depth priors for the selected view. This provides a rough supervision for the area to be inpainted. Secondly, we incorporate a 2D diffusion model to enhance the quality of the inpainted regions, reducing distortions and elevating the overall visual fidelity. Through extensive experiments, we demonstrate the effectiveness of our single-mask inpainting framework. The results show that our approach successfully inpaints complex geometry and produces visually plausible and realistic outcomes. Our code will be released.

# 240

IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination

Xi Chen · Sida Peng · Dongchen Yang · Yuan Liu · Bowen Pan · Chengfei Lyu · Xiaowei Zhou

This paper aims to recover object materials from posed images captured under an unknown static lighting condition. Recent methods solve this task by representing materials using neural networks and optimizing model parameters through differentiable physically based rendering. However, due to the coupling between object geometry, materials, and environment lighting, there is inherent ambiguity during the inverse rendering process, preventing previous methods from obtaining accurate results. To overcome this ill-posed problem, our key idea is to learn the material prior with a generative model for regularizing the optimization process. We observe that the general rendering equation can be split into diffuse and specular shading terms, and thus formulate the material prior as diffusion models of albedo and specular. Thanks to this design, our model can be trained using the existing abundant 3D object data, and naturally acts as a versatile tool to resolve the ambiguity when recovering many material representations from RGB images. In addition, we develop a coarse-to-fine training strategy that leverages estimated materials to guide diffusion models to produce multi-view consistent constraints, leading to more stable and accurate results. Extensive experiments on real-world and synthetic datasets demonstrate that our approach achieves state-of-the-art performance on material recovery. The code will be released upon acceptance.

# 246

DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly

Fenggen Yu · Yiming Qian · Xu Zhang · Francisca Gil-Ureta · Brian Jackson · Eric Bennett · Hao Richard Zhang

We present a differentiable rendering framework to learn structured 3D abstractions in the form of primitive assemblies from sparse RGB images capturing a 3D object. By leveraging differentiable volume rendering, our method does not require 3D supervision. Architecturally, our network follows the general pipeline of an image-conditioned neural radiance field (NeRF) exemplified by pixelNeRF for color prediction. As our core contribution, we introduce differential primitive assembly (DPA) into NeRF to output a 3D occupancy field in place of density prediction, where the predicted occupancies serve as opacity values for volume rendering. Our network, coined DPA-Net, produces a union of convexes, each as an intersection of convex quadric primitives, to approximate the target 3D object, subject to an abstraction loss and a masking loss, both defined in the image space upon volume rendering. With test-time adaptation and additional sampling and loss designs aimed at improving the accuracy and compactness of the obtained assemblies, our method demonstrates superior performance over state-of-the-art alternatives for 3D primitive abstraction from sparse views.

# 250

Strong Double Blind

NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis

Yubin Hu · Xiaoyang Guo · Yang Xiao · Jingwei Huang · Yong-Jin Liu

This paper presents NGP-RT, a novel approach for enhancing the rendering speed of Instant-NGP to achieve real-time novel view synthesis. As a classic NeRF-based method, Instant-NGP stores implicit features in multi-level grids or hash tables and applies a shallow MLP to convert the implicit features into explicit colors and densities. Although it achieves fast training speed, there is still a lot of room for improvement in its rendering speed due to the per-point MLP executions for implicit multi-level feature aggregation, especially for real-time applications. To address this challenge, our proposed NGP-RT explicitly stores colors and densities as hash features, and leverages a lightweight attention mechanism to disambiguate the hash collisions instead of using computationally intensive MLP. At the rendering stage, NGP-RT incorporates a pre-computed occupancy distance grid into the ray marching strategy to inform the distance to the nearest occupied voxel, thereby reducing the number of marching points and global memory access. Experimental results show that on the challenging Mip-NeRF360 dataset, NGP-RT achieves better rendering quality than previous NeRF-based methods, achieving \textbf{108 fps} at \textbf{1080p} resolution on a single Nvidia RTX 3090 GPU. Our approach is promising for NeRF-based real-time applications that require efficient and high-quality rendering.

# 238

CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering

Haidong Zhu · Tianyu Ding · Tianyi Chen · Ilya Zharkov · Ram Nevatia · Luming Liang

Generalizability and few-shot learning are key challenges in Neural Radiance Fields (NeRF), often due to the lack of a holistic understanding in pixel-level rendering. We introduce CaesarNeRF, an end-to-end approach that leverages scene-level CAlibratEd SemAntic Representation along with pixel-level representations to advance few-shot, generalizable neural rendering, facilitating a holistic understanding without compromising high-quality details. CaesarNeRF explicitly models pose differences of reference views to combine scene-level semantic representations, providing a calibrated holistic understanding. This calibration process aligns various viewpoints with precise location and is further enhanced by sequential refinement to capture varying details. Extensive experiments on public datasets, including LLFF, Shiny, mip-NeRF 360, and MVImgNet, show that CaesarNeRF delivers state-of-the-art performance across varying numbers of reference views, proving effective even with a single reference image.

# 245

Strong Double Blind

2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction

Atsuya Nakata · Takao Yamanaka

Omni-directional images have been increasingly used in various applications, including virtual reality and SNS (Social Networking Services). However, their availability is comparatively limited in contrast to normal field of view (NFoV) images, since specialized cameras are required to take omni-directional images. Consequently, several methods have been proposed based on generative adversarial networks (GAN) to synthesize omni-directional images, but these approaches have shown difficulties in training of the models, due to instability and/or significant time consumption in the training. To address these problems, this paper proposes a novel omni-directional image synthesis method, 2S-ODIS (Two-Stage Omni-Directional Image Synthesis), which generated high-quality omni-directional images but drastically reduced the training time. This was realized by utilizing the VQGAN (Vector Quantized GAN) model pre-trained on a large-scale NFoV image database such as ImageNet without fine-tuning. Since this pre-trained model does not represent distortions of omni-directional images in the equi-rectangular projection (ERP), it cannot be applied directly to the omni-directional image synthesis in ERP. Therefore, two-stage structure was adopted to first create a global coarse image in ERP and then refine the image by integrating multiple local NFoV images in the higher resolution to compensate the distortions in ERP, both of which are based on the pre-trained VQGAN model. As a result, the proposed method, 2S-ODIS, achieved the reduction of the training time from 14 days in OmniDreamer to four days in higher image quality.

# 242

Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction

Xinhang Liu · Jiaben Chen · Shiu-Hong Kao · Yu-Wing Tai · Chi-Keung Tang

Novel view synthesis via Neural Radiance Fields (NeRFs) or 3D Gaussian Splatting (3DGS) typically necessitates dense observations with hundreds of input images to circumvent artifacts. We introduce Deceptive-NeRF/3DGS to enhance sparse-view reconstruction with only a limited set of input images, by leveraging a diffusion model pre-trained from multiview datasets. Different from using diffusion priors to regularize representation optimization, our method directly uses diffusion-generated images to train NeRF/3DGS as if they were real input views. Specifically, we propose a deceptive diffusion model turning noisy images rendered from few-view reconstructions into high-quality photorealistic pseudo-observations. To resolve consistency among pseudo-observations and real input views, we develop an uncertainty measure to guide the diffusion model's generation. Our system progressively incorporates diffusion-generated pseudo-observations into the training image sets, ultimately densifying the sparse input observations by 5 to 10 times. Extensive experiments across diverse and challenging datasets validate that our approach outperforms existing state-of-the-art methods and is capable of synthesizing novel views with super-resolution in the few-view setting, a feature not typically achievable by other competing approaches. Codes will be released upon acceptance.

# 241

Strong Double Blind

Deep Polarization Cues for Single-shot Shape and Subsurface Scattering Estimation

chenhao li · Trung Thanh Ngo · Hajime Nagahara

In this work, we propose a novel learning-based method to jointly estimate the shape and subsurface scattering (SSS) parameters of translucent objects by utilizing polarization cues. Although polarization cues have been used in various applications, such as shape from polarization (SfP), BRDF estimation, and reflection removal, their application in SSS estimation has not yet been explored. Our observations indicate that the SSS affects not only the light intensity but also the polarization signal. Consequently, the polarization signal can provide additional cues for SSS estimation. We also introduce the first large-scale synthetic dataset of polarized translucent objects for training our model. Our method outperforms several baselines from the SfP and inverse rendering realms on both synthetic and real data, as demonstrated by qualitative and quantitative results.

# 239

Strong Double Blind

High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs

Ruikang Xu · Mingde Yao · Yue Li · Yueyi Zhang · Zhiwei Xiong

Novel view synthesis has achieved remarkable quality and efficiency by the paradigm of 3D Gaussian Splatting (3D-GS), but still faces two challenges: 1) significant performance degradation when trained with only few-shot samples due to a lack of geometry constraint, and 2) incapability of rendering at a higher resolution that is beyond the input resolution of training samples. In this paper, we propose Dual-Lens 3D-GS (DL-GS) to achieve high-resolution (HR) and few-shot view synthesis, by leveraging the characteristics of the asymmetric dual-lens system commonly equipped on mobile devices. This kind of system captures the same scene with different focal lengths (\textit{i.e.}, wide-angle and telephoto) under an asymmetric stereo configuration, which naturally provides geometric hints for few-shot training and HR guidance for resolution improvement. Nevertheless, there remain two major technical problems to achieving this goal. First, how to effectively exploit the geometry information from the asymmetric stereo configuration? To this end, we propose a consistency-aware training strategy, which integrates a dual-lens-consistent loss to regularize the 3D-GS optimization. Second, how to make the best use of the dual-lens training samples to effectively improve the resolution of newly synthesized views? To this end, we design a multi-reference-guided refinement module to select proper telephoto and wide-angle guided images from training samples based on the camera pose distances, and then exploit their information for high-frequency detail enhancement. Extensive experiments on simulated and real-captured datasets validate the distinct superiority of our DL-GS over various competitors on the task of HR and few-shot view synthesis.

# 230

Strong Double Blind

Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction

Rui Peng · Shihe Shen · Kaiqiang Xiong · Huachen Gao · Jianbo Jiao · Xiaodong Gu · Ronggang Wang

Generalizable neural surface reconstruction methods have attracted widespread attention due to their superiority in both reconstruction speed and quality, especially in sparse settings. However, existing methods are impeded by the memory constraint or the requirement of ground-truth depths and cannot recover satisfactory geometric details. To this end, we propose \textit{SuRF}, a new Surface-centric framework that incorporates a new Region sparsification based on a matching Field, achieving good trade-offs between performance, efficiency and scalability. To our knowledge, this is the first unsupervised method achieving end-to-end sparsification powered by the introduced matching field, which leverages the weight distribution to efficiently locate the boundary regions containing surface. Instead of predicting an SDF value for each voxel, we present a new region sparsification to sparse the volume through judging whether the voxel is inside the surface region. In this way, our model can exploit higher frequency features around the surface with less memory and computational consumption. Extensive experiments on popular datasets demonstrate that our reconstructions exhibit high-quality details and achieve new state-of-the-art performance. We promise to release our code once the paper is accepted.

# 235

Strong Double Blind

MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views

Wangze Xu · Huachen Gao · Shihe Shen · Rui Peng · Jianbo Jiao · Ronggang Wang

Recently, the advancement of the Neural Radiance Field (NeRF) has facilitated few-shot Novel View Synthesis (NVS), which presents a significant challenge in 3D vision applications. Despite numerous attempts to reduce the dense input requirement in NeRF, it still suffers from time-consumed training and rendering processes. More recently, 3D Gaussian Splatting (3DGS) achieves real-time high-quality rendering with an explicit point-based representation. However, similar to NeRF, it tends to overfit the train views for a lack of constraints. In this paper, we propose a few-shot NVS method that excavates the multi-view priors based on 3D Gaussian Splatting. We leverage the recent learning-based Multi-view Stereo (MVS) to enhance the quality of geometric initialization for 3DGS. To mitigate overfitting, we propose a forward-warping method for additional appearance constraints conforming to scenes based on the computed geometry. Furthermore, to facilitate proper convergence of optimization, we introduce a view-consistent geometry constraint for Gaussian parameters and utilize a monocular depth regularization as compensation. Experiments show that the proposed method achieves state-of-the-art performance with real-time rendering speed.

# 219

Dual-Camera Smooth Zoom on Mobile Phones

Renlong Wu · Zhilu Zhang · Yu Yang · Wangmeng Zuo

When zooming between dual cameras on a mobile, noticeable jumps in geometric content and image color occur in the preview, inevitably affecting the user’s zoom experience. In this work, we introduce a new task, \ie, dual-camera smooth zoom (DCSZ) to achieve a smooth zoom preview. The frame interpolation (FI) technique is a potential solution but struggles with ground-truth collection. To address the issue, we suggest a data factory solution where continuous virtual cameras are assembled to generate DCSZ data by rendering reconstructed 3D models of the scene. In particular, we propose a novel dual-camera smooth zoom Gaussian Splatting (ZoomGS), where a camera-specific encoding is introduced to construct a specific 3D model for each virtual camera. With the proposed data factory, we construct a synthetic dataset for DCSZ, and we utilize it to fine-tune FI models. In addition, we collect real-world dual-zoom images without ground-truth for evaluation. Extensive experiments are conducted with multiple FI methods. The results show that the fine-tuned FI models achieve a significant performance improvement over the original ones on DCSZ task. The datasets, codes, and pre-trained models will be publicly available.

# 225

Strong Double Blind

6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

Matteo Bortolon · Theodoros Tsesmelis · Stuart James · Fabio Poiesi · Alessio Del Bue

We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an a priori pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite operating in initialization pose free mode. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.

# 210

SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Mingrui Li · Shuhong Liu · Heng Zhou · Guohao Zhu · Na Cheng · Tianchen Deng · Hongyu Wang

We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.

# 248

Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing

Jian Gao · chun gu · Youtian Lin · Zhihao Li · Hao Zhu · Xun Cao · Li Zhang · Yao Yao

In this paper, we present a novel differentiable point-based rendering framework to achieve photo-realistic relighting. To make the reconstructed scene relightable, we enhance vanilla 3D Gaussians by associating extra properties, including normal vectors, BRDF parameters, and incident lighting from various directions. From a collection of multi-view images, the 3D scene is optimized through 3D Gaussian Splatting while BRDF and lighting are decomposed by physically based differentiable rendering. To produce plausible shadow effects in real-time relighting, we introduce an innovative point-based ray tracing with the bounding volume hierarchies for efficient visibility pre-computation. Extensive experiments demonstrate our improved BRDF estimation, novel view synthesis and relighting results compared to state-of-the-art approaches. The proposed framework showcases the potential to revolutionize the mesh-based graphics pipeline with a point-based pipeline enabling editing, tracing, and relighting.

# 244

Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians

Guangchi Fang · Bing Wang

In this study, we explore the challenge of efficiently representing scenes with a constrained number of Gaussians. Our analysis shifts from traditional graphics and 2D computer vision to the perspective of point clouds, highlighting the inefficient spatial distribution of Gaussian representation as a key limitation in model performance. To address this, we introduce strategies for densification including blur split and depth reinitialization, and simplification through Gaussian binarization and sampling. These techniques reorganize the spatial positions of the Gaussians, resulting in significant improvements across various datasets and benchmarks in terms of rendering quality, resource consumption, and storage compression. Our proposed Mini-Splatting method integrates seamlessly with the original rasterization pipeline, providing a strong baseline for future research in Gaussian-Splatting-based works.

# 253

CompGS: Smaller and Faster Gaussian Splatting with Vector Quantization

K L Navaneet · Kossar Pourahmadi · Soroush Abbasi Koohpayegani · Hamed Pirsiavash

3D Gaussian Splatting (3DGS) is a new method for modeling and rendering 3D radiance fields that achieves much faster learning and rendering time compared to SOTA NeRF methods. However, it comes with a drawback in the much larger storage demand compared to NeRF methods since it needs to store the parameters for several 3D Gaussians. We notice that many Gaussians may share similar parameters, so we introduce a simple vector quantization method based on \kmeans algorithm to quantize the Gaussian parameters. Then, we store the small codebook along with the index of the code for each Gaussian. We compress the indices further by sorting them and using a method similar to run-length encoding. Moreover, we use a simple regularizer to encourage zero opacity (invisible Gaussians) to reduce the storage and rendering time by a large factor through reducing the number of Gaussians. We do extensive experiments on standard benchmarks as well as an existing 3D dataset that is an order of magnitude larger than the standard benchmarks used in this field. We show that our simple yet effective method can reduce the storage cost for 3DGS by 40x to 50x and rendering time by 2x to 3x with a very small drop in the quality of rendered images.

# 256

Strong Double Blind

Segmentation-guided Layer-wise Image Vectorization with Gradient Fills

Hengyu Zhou · Hui Zhang · Bin Wang

The widespread use of vector graphics creates a significant demand for vectorization methods. While recent learning-based techniques have shown their capability to create vector images of clear topology, filling these primitives with gradients remains a challenge. In this paper, we propose a segmentation-guided vectorization framework to convert raster images into concise vector graphics with radial gradient fills. With the guidance of an embedded gradient-aware segmentation subroutine, our approach progressively appends gradient-filled Bézier paths to the output, where primitive parameters are initiated with our newly designed initialization technique and are optimized to minimize our novel loss function. We build our method on a differentiable renderer with traditional segmentation algorithms to develop it as a model-free tool for raster-to-vector conversion. It is tested on various inputs to demonstrate its feasibility, independent of datasets, to synthesize vector graphics with improved visual quality and layer-wise topology compared to prior work.

# 258

Strong Double Blind

EpipolarGAN: Omnidirectional Image Synthesis with Explicit Camera Control

Christopher May · Daniel Aliaga

In recent years, generative networks have achieved high quality results in 3D-aware image synthesis. However, most prior approaches focus on outside-in generation of a single object or face, as opposed to full inside-looking-out scenes. Those that do generate scenes typically require depth/pose information, or do not provide camera positioning control. We introduce EpipolarGAN, an omnidirectional Generative Adversarial Network for interior scene synthesis that does not need depth information, yet allows for direct control over the camera viewpoint. Rather than conditioning on an input position, we directly resample the input features to simulate a change of perspective. To reinforce consistency between viewpoints, we introduce an epipolar loss term that employs feature matching along epipolar arcs in the feature-rich intermediate layers of the network. We validate our results with comparisons to recent methods, and we formulate a generative reconstruction metric to evaluate multi-view consistency.

# 218

Strong Double Blind

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Chao Xu · Ang Li · Linghao Chen · Yulin Liu · Ruoxi Shi · Hao Su · Minghua Liu

Open-world 3D generation has recently attracted considerable attention. While many single-image-to-3D methods have yielded visually appealing outcomes, they often lack sufficient controllability and tend to produce hallucinated regions that may not align with users' expectations. In this paper, we explore an important scenario in which the input consists of one or a few unposed 2D images of a single object, with little or no overlap. We propose a novel method, \ours, to reconstruct a 3D textured mesh and estimate the relative camera poses for these sparse-view images. \ours distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views. The diffusion model is trained to jointly predict surrogate representations for camera poses and multi-view images of the object under known poses, integrating all information from the input sparse views. These predictions are then leveraged to accomplish 3D reconstruction and pose estimation, and the reconstructed 3D model can be used to further refine the camera poses of input views. Through extensive experiments on three datasets, we demonstrate that our method not only significantly outperforms baseline methods in terms of 3D reconstruction quality and pose prediction accuracy but also exhibits strong efficiency. It requires only about 20 seconds to produce a textured mesh and camera poses for the input views.

# 260

GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation

Yinghao Xu · Zifan Shi · Wang Yifan · Hansheng Chen · Ceyuan Yang · Sida Peng · Yujun Shen · Gordon Wetzstein

We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. Together, our transformer architecture and the use of 3D Gaussians unlock a scalable and efficient reconstruction framework. Extensive experimental results demonstrate the superiority of our method over alternatives regarding both reconstruction quality and efficiency. We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models. We will release the code and models to facilitate future research.

# 224

Strong Double Blind

GenRC: Generative 3D Room Completion from Sparse Image Collections

Ming-Feng Li · Yueh-Feng Ku · Hong-Xuan Yen · Chi Liu · Yu-Lun Liu · Albert Y Chen · Cheng-Hao Kuo · Min Sun

Sparse RGBD scene completion is a challenging task especially when considering consistent textures and geometries throughout the entire scene. Different from existing solutions that rely on human-designed text prompts or predefined camera trajectories, we propose GenRC, an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. To achieve this, we first project the sparse RGBD images to a highly incomplete 3D mesh. Instead of iteratively generating novel views to fill in the void, we utilized our proposed E-Diffusion to generate a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency. Furthermore, we maintain the input-output scene stylistic consistency through textual inversion to replace human-designed text prompts. To bridge the domain gap among datasets, E-Diffusion leverages models trained on large-scale datasets to generate diverse appearances. GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets, even though GenRC is not trained on these datasets nor using predefined camera trajectories.

# 254

Strong Double Blind

Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval

Aneeshan Sain · Pinaki Nath Chowdhury · Subhadeep Koley · Ayan Kumar Bhunia · Yi-Zhe Song

In this paper, we delve into the intricate dynamics of Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) by addressing a critical yet overlooked aspect -- the choice of viewpoint during sketch creation. Unlike photo systems that seamlessly handle diverse views through extensive datasets, sketch systems, with limited data collected from fixed perspectives, face challenges. Our pilot study, employing a pre-trained FG-SBIR model, highlights the system's struggle when query-sketches differ in viewpoint from target instances. Interestingly, a questionnaire shows users desire autonomy, with a significant percentage favouring view-specific retrieval. To reconcile this, we advocate for a view-aware system, seamlessly accommodating both view-agnostic and view-specific tasks. Overcoming dataset limitations, our first contribution leverages multi-view 2D projections of 3D objects, instilling cross-modal view awareness. The second contribution introduces a customisable cross-modal feature through disentanglement, allowing effortless mode switching.

# 148

Strong Double Blind

Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees

Robin Kenis · Emanuel Laude · Panagiotis Patrinos

While neural network models have garnered significant attention in the imaging community, their application remains limited in important settings where optimality certificates are required or in the absence of extensive datasets. In such cases, classical models like (continuous) Markov Random Fields (MRFs) remain preferable. However, the associated optimization problem is nonconvex, and therefore very challenging to solve globally. This difficulty is further exacerbated in the case of nonconvex state spaces, such as the unit sphere. To address this, we propose a convex Semidefinite Programming (SDP) relaxation to provide lower bounds for these optimization challenges. Our relaxation provably approximates a certain infinite-dimensional convex lifting in measure spaces. Notably, our approach furnishes a certificate of (near) optimality when the relaxation (closely) approximates the unlifted problem. Our experiments show that our relaxation outperforms popular linear relaxations for many interesting problems.

# 287

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Minghao Chen · Iro Laina · Andrea Vedaldi

We consider the problem of editing 3D objects and scenes based on open-ended language instructions. The established paradigm to solve this problem is to use a 2D image generator or editor to guide the 3D editing process. However, this is often slow as it requires do update a computationally expensive 3D representations such as a neural radiance field, and to do so by using contradictory guidance from a 2D model which is inherently not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two ways. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. We do so by utilizing a training-free approach which integrates cues from the underlying 3D geometry of the scene. Second, given a multi-view consistent edited sequence of images of the object, we directly and efficiently optimize the 3D object representation, which is based on 3D Gaussian Splatting. Because it does not require to apply edits incrementally and iteratively, DGE is significantly more efficient than existing approaches, and comes with other perks such as allowing selective editing of parts.

# 276

Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting

Ri-Zhao Qiu · Ge Yang · Weijia Zeng · Xiaolong Wang

Scene representations using 3D Gaussian primitives have produced excellent results in modeling the appearance of static and dynamic 3D scenes. Many graphics applications, however, demand the ability to manipulate both the appearance and the physical properties of objects. We introduce Feature Splatting, an approach that unifies physics-based dynamic scene synthesis with rich semantics from vision language foundation models that are grounded by natural language. Our first contribution is a way to distill high-quality, object-centric vision-language features into 3D Gaussians, that enables semi-automatic scene decomposition using text queries. Our second contribution is a way to synthesize physics-based dynamics from an otherwise static scene using a particle-based simulator, in which material properties are assigned automatically via text queries. We ablate key techniques used in this pipeline, to illustrate the challenge and opportunities in using feature-carrying 3D Gaussians as a unified format for appearance, geometry, material properties and semantics grounded on natural language.

# 259

GVGEN: Text-to-3D Generation with Volumetric Representation

Xianglong He · Junyi Chen · Sida Peng · Di Huang · Yangguang Li · Xiaoshui Huang · Chun Yuan · Wanli Ouyang · Tong He

In recent years, 3D Gaussian splatting has emerged as a powerful technique for 3D reconstruction and generation, known for its fast and high-quality rendering capabilities. To address these shortcomings, this paper introduces a novel diffusion-based framework, GVGEN, designed to efficiently generate 3D Gaussian representations from text input. We propose two innovative techniques:(1) \textit{Structured Volumetric Representation}. We first arrange disorganized 3D Gaussian points as a structured form GaussianVolume. This transformation allows the capture of intricate texture details within a volume composed of a fixed number of Gaussians. To better optimize the representation of these details, we propose a unique pruning and densifying method named the Candidate Pool Strategy, enhancing detail fidelity through selective optimization. (2) \textit{Coarse-to-fine Generation Pipeline}. To simplify the generation of GaussianVolume and empower the model to generate instances with detailed 3D geometry, we propose a coarse-to-fine pipeline. It initially constructs a basic geometric structure, followed by the prediction of complete Gaussian attributes. Our framework, GVGEN, demonstrates superior performance in qualitative and quantitative assessments compared to existing 3D generation methods. Simultaneously, it maintains a fast generation speed ($\sim$7 seconds), effectively striking a balance between quality and efficiency.

# 302

Strong Double Blind

VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation

Wenjie Zhuo · Fan Ma · Hehe Fan · Yi Yang

This paper presents Invariant Score Distillation (ISD), a novel method for high-fidelity text-to-3D generation. ISD aims to tackle the over-saturation and over-smoothing problems in Score Distillation Sampling (SDS). In this paper, SDS is decoupled into a weighted sum of two components: the reconstruction term and the classifier-free guidance term. We experimentally found that over-saturation stems from the large classifier-free guidance scale and over-smoothing comes from the reconstruction term. To overcome these problems, ISD utilizes an interval sampling term derived from DDIM sampling to replace the reconstruction term in SDS. This operation allows the utilization of a medium classifier-free guidance scale and mitigates the reconstruction-related errors, thus preventing the over-smoothing and over-saturation of results. Extensive experiments demonstrate that our method greatly enhances SDS and produces realistic 3D objects through single-state optimization.

# 322

DreamReward: Aligning Human Preference in Text-to-3D Generation

junliang ye · Fangfu Liu · Qixiu Li · Zhengyi Wang · Yikai Wang · Xinzhou Wang · Yueqi Duan · Jun Zhu

3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D---the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models.

# 262

SemanticHuman-HD: High Resolution Semantic disentangled 3D Human Generation

Peng Zheng · Tao Liu · Zili Yi · Rui Ma

With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at $512^2$ resolution due to the high computational cost of neural radiance fields. To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at $1024^2$ resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost. Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis. We will release code and models.

# 301

Disentangled Clothed Avatar Generation from Text Descriptions

Jionghao Wang · Yuan Liu · Zhiyang Dou · Zhengming Yu · Yongqing Liang · Cheng Lin · Rong Xie · Li Song · Xin Li · Wenping Wang

In this paper, we introduce a novel text-to-avatar generation method that separately generates the human body and the clothes and allows high-quality animation on the generated avatar. While recent advancements in text-to-avatar generation have yielded diverse human avatars from text prompts, these methods typically combine all elements—clothes, hair, and body—into a single 3D representation. Such an entangled approach poses challenges for downstream tasks like editing or animation. To overcome these limitations, we propose a novel disentangled 3D avatar representation named Sequentially Offset-SMPL (SO-SMPL), building upon the SMPL model. SO-SMPL represents the human body and clothes with two separate meshes but associates them with offsets to ensure the physical alignment between the body and the clothes. Then, we design a Score Distillation Sampling (SDS)-based distillation framework to generate the proposed SO-SMPL representation from text prompts. In comparison with existing text-to-avatar methods, our approach not only achieves higher texture and geometry quality and better semantic alignment with text prompts, but also significantly improves the visual quality of character animation, virtual try-on, and avatar editing. We further show that our method also allows for generating complex garments, such as multi-layered clothes and skirts.

# 266

StructLDM: Structured Latent Diffusion for 3D Human Generation

Tao Hu · Fangzhou Hong · Ziwei Liu

Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can be decoded to render view-consistent humans under different poses and clothing styles. 3) A structured latent diffusion model for generative human appearance sampling. Extensive experiments validate StructLDM's state-of-the-art generation performance and illustrate the expressiveness of the structured latent space over the well-adopted 1D latent space. Notably, StructLDM enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc.

# 263

Strong Double Blind

High-Fidelity Modeling of Generalizable Wrinkle Deformation

Jingfan Guo · Jae Shin Yoon · Shunsuke Saito · Takaaki Shiratori · Hyun Soo Park

This paper proposes a generalizable model to synthesize high-fidelity clothing wrinkle deformation in 3D by learning from real data. Given the complex deformation behaviors of real-world clothing, this task presents significant challenges, primarily due to the lack of accurate ground-truth data. Obtaining high-fidelity 3D deformations requires special equipment like a multi-camera system, which is not easily scalable. To address this challenge, we decompose the clothing into a base surface and fine wrinkles, and introduce a new method that can generate wrinkles as high-frequency 3D displacement from coarse clothing deformation. Our method is conditioned by Green-Lagrange strain field—a local rotation-invariant measurement that is independent of body and clothing topology, enhancing its generalizability. Using limited real data (e.g., 3K) of a clothing, we train a diffusion model that can generate high-fidelity wrinkles from a coarse clothing mesh, conditioned on its strain field. Practically, we obtain the coarse clothing mesh using a body-conditioned VAE, ensuring compatibility of the deformation with the body pose. In our experiments, we demonstrate that our generative wrinkle model outperforms existing methods by synthesizing high-fidelity wrinkle deformation from novel body poses and clothing while preserving the quality comparable to the one from training data.

# 264

Strong Double Blind

ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

Chen Guo · Tianjian Jiang · Manuel Kaufmann · Chengwei Zheng · Julien Valentin · Jie Song · Otmar Hilliges

While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. We present ReLoo, a novel method that overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization is formulated that jointly optimizes the shape, appearance, and deformations of both the human body and clothing over the entire sequence via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. The evaluation of our method, both on existing and our novel dataset, demonstrates its clear superiority over prior art on both indoor datasets and in-the-wild videos.

# 279

Strong Double Blind

Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos

Subin Jeon · In Cho · Minsu Kim · Woong Oh Cho · Seon Joo Kim

We propose a new framework for creating and easily manipulating 3D models of arbitrary objects using casually captured videos. Our core ingredient is a novel hierarchy deformation model, which captures motions of objects with a tree-structured bones. Our hierarchy system decomposes motions based on the granularity and reveals the correlations between parts without exploiting any prior structural knowledge. We further propose to regularize the bones to be positioned at the basis of motions, centers of parts, sufficiently covering related surfaces of the part. This is achieved by our bone occupancy function, which identifies whether a given 3D point is placed within the bone. Coupling the proposed components, our framework offers several clear advantages: (1) Users can obtain animatable 3D models of the arbitrary objects in improved quality from their casual videos, (2) users can manipulate 3D models in an intuitive manner with minimal costs, and (3) users can interactively add or delete control points as necessary. The experimental results demonstrate the efficacy of our framework on diverse instances, in reconstruction quality, interpretability and easier manipulation.

# 150

Physics-Based Interaction with 3D Objects via Video Generation

Tianyuan Zhang · Hong-Xing Yu · Rundi Wu · Brandon Y Feng · Changxi Zheng · Noah Snavely · Jiajun Wu · William Freeman

Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these properties, such as object stiffness. However, estimating physical material properties is an open problem due to the lack of material ground-truth data, as measuring these properties for real objects is highly difficult. We present PhysDreamer, a physics-based approach that endows static 3D objects with interactive dynamics by leveraging the object dynamics priors learned by video generation models. By distilling these priors, PhysDreamer enables the synthesis of realistic object responses to novel interactions, such as external forces or agent manipulations. We demonstrate our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study. PhysDreamer takes a step towards more engaging and realistic virtual experiences by enabling static 3D objects to dynamically respond to interactive stimuli in a physically plausible manner.

# 265

Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder

Jiajie Fan · Amal Trigui · Thomas Bäck · Hao Wang

A great interest has arisen in using Deep Generative Models (DGM) for generative design. When assessing the quality of the generated designs, human designers focus more on structural plausibility, e.g., no missing component, rather than visual artifacts, e.g., noises in the images. Meanwhile, commonly used metrics such as Fréchet Inception Distance (FID) may not evaluate accurately as they tend to penalize visual artifacts instead of structural implausibility. As such, FID might not be suitable to assess the performance of DGMs for a generative design task. In this work, we propose to encode the input designs with a simple Denoising Autoencoder (DAE) and measure the distribution distance in the latent space thereof. We experimentally test our DAE-based metrics with FID and other state-of-the-art metrics on three data sets: compared to FID and some more recent works, e.g., FD (DINOv2) and topology distance, DAE-based metrics can effectively detect implausible structures and are more consistent with structural inspection by human experts.

# 270

Strong Double Blind

Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors

Jae Joong Lee · Bosheng Li · Sara Beery · Jonathan Huang · Songlin Fei · Raymond Yeh · Bedrich Benes

We introduce Tree-D Fusion, featuring the first collection of 600,000 environmentally aware, 3D simulation-ready tree models generated through Diffusion priors. Each reconstructed 3D tree model corresponds to an image from Google's Auto Arborist Dataset, comprising street view images and associated genus labels of trees across North America. Our method distills the scores of two tree-adapted diffusion models by utilizing text prompts to specify a tree genus, thus facilitating shape reconstruction. This process involves reconstructing a 3D tree envelope filled with point markers, which are subsequently utilized to estimate the tree's branching structure using the space colonization algorithm conditioned on a specified genus.

# 206

Strong Double Blind

Self-supervised Shape Completion via Involution and Implicit Correspondences

Mengya Liu · Ajad Chhatkuli · Janis Postels · Luc Van Gool · Federico Tombari

3D shape completion is traditionally solved using supervised training or by distribution learning on complete shape examples. Recently self-supervised learning approaches that do not require any complete 3D shape examples have gained more interests. In this paper, we propose a non-adversarial self-supervised approach for the shape completion task. Our first finding is that completion problems can be formulated as an involutory function trivially, which implies a special constraint on the completion function $f$, such that $f \circ f(x) = x$. Our second constraint on self-supervised shape completion relies on the fact that shape completion becomes easier to solve with correspondences and similarly, completion can simplify the correspondences problem. We formulate a consistency measure in the canonical space in order to supervise the completion function. We efficiently optimize the completion and correspondence modules using ``freeze and alternate'' strategy. The overall approach performs well for rigid shapes in a category as well as dynamic non-rigid shapes. We ablate our design choices and compare our solution against state-of-the-art methods, showing remarkable accuracy approaching supervised accuracy in some cases.

# 223

Strong Double Blind

Self-Training Room Layout via Geometry-aware Ray-casting

Bolivar Solarte · Chin-Hsuan Wu · Jin-Cheng Jhang · Jonathan Lee · Yi-Hsuan Tsai · Min Sun

In this paper, we present a novel geometry-aware pseudo-labeling framework that exploits the multi-view layout consistency of noisy estimates for self-training room layout estimation models on unseen scenes. In particular, our approach leverages a ray-casting formulation to aggregate and sample multiple estimates by considering their geometry consistency and camera proximity. As a result, our pseudo-labels can effectively leverage unseen scenes with different environmental conditions, complex room geometries, and different architectural styles without any label annotation. Results on publicly available datasets and a substantial improvement in current state-of-the-art layout estimation models show the effectiveness of our contributions.

# 214

Strong Double Blind

DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting

Linus Härenstam-Nielsen · Lu Sang · Abhishek Saroha · Nikita Araslanov · Daniel Cremers

Fitting neural implicit surfaces to point clouds is typically done by encouraging the network output to equal zero on the point cloud. Yet, since the underlying shape metric is not symmetric, previous methods are susceptible to spurious surfaces. We theoretically analyze the predominant approach for dealing with spurious surfaces, and show that it is equivalent to regularizing the surface area, leading to over-smoothing. To address these shortcomings, we propose a novel loss function corresponding to the symmetric Chamfer distance. It assures both that that the points are near the surface and that the surface is near the points. Our approach reliably recovers a high level of shape detail and eliminates spurious surfaces without the need for additional regularization. To make our approach more practical, we further propose an efficient method for uniformly sampling point batches from the implicit surface. The full implementation of our method and experiments is provided in the supplemental material and will be publicly released upon acceptance.

# 211

Strong Double Blind

GaussReg: Fast 3D Registration with Gaussian Splatting

Jiahao Chang · Yinglin Xu · Yihao Li · Yuantao Chen · Wensen Feng · XIAOGUANG HAN

Point cloud registration is a fundamental problem for large-scale 3D scene scanning and reconstruction. With the help of deep learning, registration methods have evolved significantly, reaching a nearly-mature stage. As the introduction of Neural Radiance Fields (NeRF), it has become the most popular 3D scene representation as its powerful view synthesis capabilities. Regarding NeRF representation, its registration is also required for large-scale scene reconstruction. However, this topic extremly lacks exploration. This is due to the inherent challenge to model the geometric relationship among two scenes with implicit representations. The existing methods usually convert the implicit representation to explicit representation for further registration. Most recently, Gaussian Splatting (GS) is introduced, employing explicit 3D Gaussian. This method significantly enhances rendering speed while maintaining high rendering quality. Given two scenes with explicit GS representations, in this work, we explore the 3D registration task between them. To this end, we propose GaussReg, a novel coarse-to-fine framework, both fast and accurate. The coarse stage follows existing point cloud registration methods and estimates a rough alignment for point clouds from GS. We further newly present an image-guided fine registration approach, which renders images from GS to provide more detailed geometric information for precise alignment. To support comprehensive evaluation, we carefully build a scene-level dataset called ScanNet-GSReg with 1379 scenes obtained from the ScanNet dataset and collect an in-the-wild dataset called GSReg. Experimental results demonstrate our method achieves state-of-the-art performance on multiple datasets. Our GaussReg is 44× faster than HLoc (SuperPoint as the feature extractor and SuperGlue as the matcher) with comparable accuracy.

# 208

Strong Double Blind

AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion

Zhiheng Fu · Longguang Wang · Lian Xu · Zhiyong Wang · Hamid Laga · Yulan Guo · Farid Boussaid · Mohammed Bennamoun

Point cloud completion involves inferring missing parts of 3D objects from incomplete point cloud data. It requires a model that understands the global structure of the object and reconstructs local details. To this end, we propose a global perception and local attention network, termed AEDNet, for point cloud completion. The proposed AEDNet utilizes designed adaptive point cloud embedding and disentanglement (AED) module in both the encoder and decoder to globally embed and locally disentangle the given point cloud. In the AED module, we introduce a global embedding operator that employs the devised slot attention to compose point clouds into different embeddings, each focusing on specific parts of 3D objects. Then, we proposed a multiview-aware disentanglement operator to disentangle geometric information from those embeddings in the 3D viewpoints generated on a unit sphere. These 3D viewpoints enable us to observe point clouds from the outside rather than from within, resulting in a comprehensive understanding of their geometry. Additionally, the arbitrary number of points and point-wise features can be disentangled by changing the number of viewpoints, reaching high flexibility. Experiments show that our proposed method achieves state-of-the-art results on both MVP and PCN datasets.

# 207

Strong Double Blind

PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration

Runzhao Yao · Shaoyi Du · Wenting Cui · Canhui Tang · Chengwu Yang

Learning rotation-invariant distinctive features is a fundamental requirement for point cloud registration. Existing methods often use rotation-sensitive networks to extract features, while employing rotation augmentation to rudely learn an approximate invariant mapping. This makes networks fragile to rotations, overweight, and hinders the distinctiveness of features. To tackle these problems, we propose a novel position-aware rotation-equivariant network, for efficient, light-weighted, and robust registration. The network can provide a strong model inductive bias to learn rotation-equivariant/invariant features, thus addressing the aforementioned limitations. To further improve the distinctiveness of descriptors, we propose a position-aware convolution, which can better learn spatial information of local structures. Moreover, we also propose a feature-based hypothesis proposer. It leverages rotation-equivariant features that encode fine-grained structure orientations to generate reliable model hypotheses. Each correspondence can generate a hypothesis, thus it is more efficient than classic estimators that require multiple reliable correspondences. Accordingly, a contrastive rotation loss is presented to enhance the robustness of rotation-equivariant features against data degradation. Extensive experiments on indoor and outdoor datasets demonstrate that our method significantly outperforms the SOTA methods in terms of registration recall while being lightweight and keeping a fast speed. Moreover, experiments on rotated datasets demonstrate its robustness against rotation variations. All codes will be available.

# 205

Strong Double Blind

DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding

Jincen Jiang · Qianyu Zhou · Yuhang Li · Xuequan Lu · Meili Wang · Lizhuang Ma · Jian Chang · Jian Jun Zhang

Recent point cloud understanding research suffers from performance drops on unseen data, due to the distribution shifts across different domains. While recent studies use Domain Generalization (DG) techniques to mitigate this by learning domain-invariant features, most are designed for a single task and neglect the potential of testing data. Despite In-Context Learning (ICL) showcasing multi-task learning capability, it usually relies on high-quality context-rich data and considers a single dataset, and has rarely been studied in point cloud understanding. In this paper, we introduce a novel, practical, multi-domain multi-task setting, handling multiple domains and multiple tasks within one unified model for domain generalized point cloud understanding. To this end, we propose Domain Generalized Point-In-Context Learning (DG-PIC) that boosts the generalization ability across various tasks and domains at testing time. In particular, we develop dual-level source prototype estimation that considers both global-level shape contextual and local-level geometrical structures for representing source domains and a dual-level test-time feature shifting mechanism that leverages both macro-level domain semantic information and micro-level patch positional relationships to pull the target data closer to the source ones during testing. Our DG-PIC does not require any model updates during testing and can handle unseen domains and multiple tasks, i.e., point cloud reconstruction, denoising, and registration, within one unified model. We also introduce a benchmark for this new setting. Comprehensive experiments demonstrate that DG-PIC outperforms state-of-the-art techniques significantly. Our code and benchmark are available at: https://github.com/Jinec98/DG-PIC.

# 192

ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention

Chenhang He · Ruihuang Li · Guowen Zhang · Yabin Zhang

Window-based transformers excel in large-scale point cloud understanding by capturing context-aware representations with affordable attention computation in a more localized manner. However, the sparse nature of point clouds leads to significant variance in the number of voxels per window. Existing methods group the voxels in each window into fixed-length sequences through extensive sorting and padding operations, resulting in a non-negligible computational and memory overhead. In this paper, we introduce ScatterFormer, which to the best of our knowledge, is the first to directly apply attention to voxels across different windows as a single sequence. The key of ScatterFormer is a Scattered Linear Attention (SLA) module, which leverages the pre-computation of key-value pairs in linear attention to enable parallel computation on the variable-length voxel sequences divided by windows. Leveraging the hierarchical structure of GPUs and shared memory, we propose a chunk-wise algorithm that reduces the SLA module's latency to under 1 millisecond on moderate GPUs. Furthermore, we develop a cross-window interaction module that improves the locality and connectivity of voxel features across different windows, eliminating the need for extensive window shifting. Our proposed ScatterFormer demonstrates 73 mAP (L2) on the large-scale Waymo Open Dataset and 70.5 NDS on the NuScenes dataset, running at an outstanding detection rate of 28 FPS.

# 202

Strong Double Blind

SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds

Yanbo Wang · Wentao Zhao · Cao Chuan · Tianchen Deng · Jingchuan Wang · Weidong Chen

Although LiDAR semantic segmentation advances rapidly, state-of-the-art methods often incorporate specifically designed inductive bias derived from benchmarks originating from mechanical spinning LiDAR. This can limit model generalizability to other kinds of LiDAR technologies and make hyperparameter tuning more complex. To tackle these issues, we propose a generalized framework to accommodate various types of LiDAR prevalent in the market by replacing window-attention with our sparse focal point modulation. Our SFPNet is capable of extracting multi-level contexts and dynamically aggregating them using a gate mechanism. By implementing a channel-wise information query, features that incorporate both local and global contexts are encoded. We also introduce a novel large-scale hybrid-solid LiDAR semantic segmentation dataset for robotic applications. SFPNet demonstrates competitive performance on conventional benchmarks derived from mechanical spinning LiDAR, while achieving state-of-the-art results on benchmark derived from solid-state LiDAR. Additionally, it outperforms existing methods on our novel dataset sourced from hybrid-solid LiDAR. Code and dataset are available at https://github.com/Cavendish518/SFPNet and https://www.semanticindustry.top.

# 176

Strong Double Blind

MAD-DR: Map Compression for Visual Localization with Matchness Aware Descriptor Dimension Reduction

Qiang Wang

3D-structure based methods remain the top-performing solution for long-term visual localization tasks. However, the dimension of existing local descriptors is usually high and the map takes huge storage space, especially for large-scale scenes. We propose a novel asymmetric framework which learns to reduce the dimension of local descriptors and match them jointly. We can compress existing local descriptor to 1/128 of original size while maintaining high matching performance. Experiments on several public visual localization datasets show that our pipeline obtains better results than existing map compression methods and non-structure based alternatives.

# 179

Strong Double Blind

Tensorial template matching for fast cross-correlation with rotations and its application for tomography

Antonio Martinez-Sanchez · Ulrike Homberg · J. M. Almira · Harold Phelippeau

Object detection is a main task in computer vision. Template matching is the reference method for detecting objects with arbitrary templates. However, template matching computational complexity depends on the rotation accuracy, being a limiting factor for large 3D images (tomograms). Here, we implement a new algorithm called tensorial template matching, based on a mathematical framework that represents all rotations of a template with a tensor field. Contrary to standard template matching, the computational complexity of the presented algorithm is independent of the rotation accuracy. Using both, synthetic and real data from tomography, we demonstrate that tensorial template matching is much faster than template matching and has the potential to improve its accuracy.

# 209

Strong Double Blind

Flowed Time of Flight Radiance Fields

Mikhail Okunev · Marc Mapeke · Benjamin Attal · Christian Richardt · Matthew O'Toole · James Tompkin

We present a method to correct for motion artifacts in continuous-wave time of flight (C-ToF) imaging. As C-ToF cameras must capture multiple exposures to derive depth, any moving object will exhibit depth errors. We formulate an optimization problem to reconstruct the raw frames captured by the camera via an underlying 4D volumetric scene and a physically-based differentiable C-ToF simulator. With weak optical flow supervision, we can infer a moving 3D scene that explains the raw captures, even though any particular time instant does not provide sufficient constraints upon the depth or motion. On synthetic sequences, we find that our approach reduces depth errors on dynamic objects by up to 20×, particularly for large disparities (≥ 25 pixels) between raw frames. On real-world sequences, we see qualitatively similar gains with artifacts resolved on falling pillows and swinging baseball bats.

# 194

Zero-Shot Image Feature Consensus with Deep Functional Maps

Xinle Cheng · Congyue Deng · Adam Harley · Yixin Zhu · Leonidas Guibas

Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of different layers or networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but also more accurate, with the possibility of better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on various dense correspondence tasks. We also demonstrate our effectiveness in keypoint correspondence and affordance map transfer.

# 197

Strong Double Blind

RSL-BA: Rolling Shutter Line Bundle Adjustment

Yongcong Zhang · Bangyan Liao · Yifei Xue · Lu Chen · Peidong Liu · Yizhen Lao

The line is a common feature in the man-made environment, inherently encoding spatial structural information. We propose an accurate and robust bundle adjustment (BA) solution that estimates the 6-DoF pose with an independent RS model of the camera and the geometry of the environment based on the line among the images captured by rolling shutter (RS) camera. However, the nonlinear nature of camera motion curves the projection of each 3D straight line on rolling shutter images, exacerbating the complexity of the original problem. To tackle the challenges, we first establish the RS line projection theory using the Plücker line parameterization and derive a series of line-to-line errors that avoid time-consuming curve-to-line (c2l) errors and exhibit the degeneration-resistant ability. We also provide complete degeneration-resistant proof of our method. Extensive synthetic and real data experiments demonstrate that our method achieves comparable efficiency with higher accuracy over existing point-based RSBA solutions.

# 185

How Far Can a 1-Pixel Camera Go? Solving Vision Tasks using Photoreceptors and Computationally Designed Visual Morphology

Andrei Atanov · Rishubh Singh · Jiawei Fu · Isabella Yu · Andrew Spielberg · Amir Zamir

A de facto standard in computer vision is to use a high-resolution camera for solving problems and choosing the placement of that camera (i.e., position and orientation) by human intuition. On the other hand, nature provides contrasting examples, wherein extremely simple, well-designed visual sensors allow for diverse and capable dynamic behaviors~\cite{landanimal2012}. In this work, motivated by these examples, we raise the following questions: \textit{1)} can very simple visual sensors solve computer vision tasks, and \textit{2)} what role does the design play in their effectiveness? We explore sensors of resolutions as low as 1x1, representing a single photoreceptor. %\textit{photoreceptor}. First, we demonstrate that just a few photoreceptors can be enough to solve tasks such as visual navigation and dynamical control with performance similar to a high-resolution camera. Second, we show that the design of these simple visual sensors plays a crucial role in their ability to provide useful information. To find a well-performing design for a given task, we present a \textit{computational design optimization} algorithm and demonstrate its effectiveness across different tasks and domains. Finally, we conduct a human study showing that, in most cases, the computational approach is superior to manual human design in finding optimal visual sensor designs, especially for simple and consequently less intuitive sensors.

# 199

Strong Double Blind

Hyperion – A fast, versatile symbolic Gaussian Belief Propagation framework for Continuous-Time SLAM

David Hug · Ignacio Alzugaray Lopez · Margarita Chli

Continuous-Time Simultaneous Localization And Mapping (CTSLAM) has become a promising approach for fusing asynchronous and multi-modal sensor setups. Unlike Discrete-Time Simultaneous Localization And Mapping (DTSLAM), which estimates poses discretely, CTSLAM uses continuous-time motion parametrizations, facilitating the integration of a variety of sensors such as rolling-shutter cameras, event cameras and Inertial Measurement Units (IMUs). However, CTSLAM approaches remain computationally demanding and are conventionally posed as centralized Non-Linear Least Squares (NLLS ) optimizations. Targeting these limitations, we not only present the fastest SymForce-based [22] B- and Z-Spline implementations achieving speedups between 2.43x and 110.31x over Sommer et al. [41] but also implement a novel continuous-time Gaussian Belief Propagation (GBP) framework for iterative, decentralized probabilistic inference suitable for multi-agent operations. We demonstrate the efficacy of our open-sourced approach through practical experiments and provide in-depth ablation studies.

# 188

Strong Double Blind

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Luca Di Giammarino · Boyang Sun · Giorgio Grisetti · Marc Pollefeys · Hermann Blum · Daniel Barath

Accurate localization in diverse environments is a fundamental challenge in computer vision and robotics. The task involves determining a sensor's precise position and orientation, typically a camera, within a given space. Traditional localization methods often rely on passive sensing, which may struggle in scenarios with limited features or dynamic environments. In response, this paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy. Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications. Our results demonstrate that our method performs better than the existing one target similar problems, generalizing on synthetic and real data. We also release an open-source implementation to benefit the community.

# 203

MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps

Jianhao Zheng · Daniel Barath · Marc Pollefeys · Iro Armeni

Creating 3D semantic reconstructions of environments is fundamental to many applications, especially when related to autonomous agent operation (e.g. goal-oriented navigation and object interaction and manipulation). Commonly, 3D semantic reconstruction systems capture the entire scene in the same level of detail. However, certain tasks (e.g., object interaction) require a fine-grained and high-resolution map, particularly if the objects to interact are of small size or intricate geometry. In recent practice, this leads to the entire map being in the same high-quality resolution, which results in increased computational and storage costs. To address this challenge we propose MAP-ADAPT, a real-time method for quality-adaptive semantic 3D reconstruction using RGBD frames. MAP-ADAPT is the first adaptive semantic 3D mapping algorithm that, unlike prior work, generates directly a single map with regions of different quality based on both the semantic information and the geometric complexity of the scene. Leveraging a semantic SLAM pipeline for pose and semantic estimation, we achieve comparable or superior results to state-of-the-art methods on a synthetic and a real-world dataset, while significantly reducing storage and computation requirements.

# 140

Strong Double Blind

iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning

Tom Fischer · Yaoyao Liu · Artur Jesslen · Noor Ahmed · Prakhar Kaushik · Angtian Wang · Alan Yuille · Adam Kortylewski · Eddy Ilg

Different from human nature, it is still common practice today for vision tasks to train deep learning models only initially and on fixed datasets. A variety of approaches have recently addressed handling continual data streams. However, extending these methods to manage out-of-distribution (OOD) scenarios has not effectively been investigated. On the other hand, it has recently been shown that non-continual Neural Mesh Models exhibit strong performance in generalizing to such OOD scenarios. To leverage this decisive property in a continual learning setting, we propose incremental Neural Mesh Models that can be extended with new meshes over time. In addition, we present a latent space initialization strategy that enables us to allocate feature space for future unseen classes in advance and a positional regularization term that forces the features of the different classes to consistently stay in respective latent space regions. We demonstrate the effectiveness of our method through extensive experiments on the Pascal3D and ObjectNet3D datasets and show that our approach outperforms the baselines for classification by 2-6% in the in-domain and by 6-50% in the OOD setting. Our work also presents the first incremental learning approach for pose estimation.

# 191

PACE: Pose Annotations in Cluttered Environments

Yang You · kai xiong · Zhening Yang · Zhengxiang Huang · Junwei Zhou · Ruoxi Shi · Zhou FANG · Adam Harley · Leonidas Guibas · Cewu Lu

Pose estimation is a crucial task in computer vision, enabling tracking and manipulating objects in images or videos. While several datasets exist for pose estimation, there is a lack of large-scale datasets specifically focusing on cluttered scenes with occlusions. This limitation is a bottleneck in the development and evaluation of pose estimation methods, particularly toward the goal of real-world application in environments where occlusions are common. Addressing this, we introduce PACE (Pose Annotations in Cluttered Environments), a large-scale benchmark designed to advance the development and evaluation of pose estimation methods in cluttered scenarios. PACE encompasses 54,945 frames with 257,673 annotations across 300 videos, covering 576 objects from 44 categories and featuring a mix of rigid and articulated items in cluttered scenes. To annotate the real-world data efficiently, we developed an innovative annotation system utilizing a calibrated 3-camera setup. We test state-of-the-art algorithms in PACE along two tracks: pose estimation, and object pose tracking, revealing the benchmark's challenges and research opportunities. We plan to release PACE as a public evaluation benchmark, along the annotations tools we developed, to stimulate further advancements in the field. Our code and data will be publicly available.

# 216

Strong Double Blind

Global-to-Pixel Regression for Human Mesh Recovery

Yabo Xiao · MINGSHU HE · Dongdong Yu

Existing human mesh recovery (HMR) methods commonly leverage the global or dense-annotations-based local features to produce a single prediction from input image. However, the compressed global feature and local features disrupt the spatial geometry of human body and make it hard to capture the local dynamics, resulting in visual-mesh misalignment. Moreover, dense annotations are labor-intensive and expensive. Toward above issues, we propose a global-to-pixel wise prediction framework to preserve spatial information and obtain precise visual-mesh alignments for top-down HMR. Specifically, we present an adaptive 2D Keypoint-Guided Local Encoding Module to enable per-pixel features to capture fine-grained body part information with structure and local context maintained. The acquisition of local features rely exclusively on sparse 2D keypoint guidance without any dense annotations or heuristics keypoint-based ROI pooling. The enhanced pixel features are used to predict residuals for rectifying initial estimation produced by global feature. Secondly, we introduce a Dynamic Matching Strategy that determines positive/negative pixels by only calculating the classification and 2D keypoint costs to further improve visual-mesh alignments. The comprehensive experiments demonstrate the effectiveness of network design. Our framework outperforms previous local regression methods by a large margin and achieves state-of-the-art performance on Human3.6m and 3DPW datasets.

# 198

3D Hand Pose Estimation in Everyday Egocentric Images

Aditya Prakash · Ruisen Tu · Matthew Chang · Saurabh Gupta

3D hand pose estimation in everyday egocentric images is challenging for several reasons: poor visual signal (occlusion from the object of interaction, low resolution & motion blur), large perspective distortion (hands are close to the camera), and lack of 3D annotations outside of controlled settings. While existing methods often use hand crops as input to focus on fine-grained visual information to deal with poor visual signal, the challenges arising from perspective distortion and lack of 3D annotations in the wild have not been systematically studied. We focus on this gap and explore the impact of different practices, i.e. crops as input, incorporating camera information, auxiliary supervision, scaling up datasets. Based on our findings, we present WildHands, a system for 3D hand pose estimation in everyday egocentric images. Zero-shot evaluation on four diverse datasets (H2O, Assembly, Epic, and EgoExo4D) demonstrate the effectiveness of our approach across 2D and 3D metrics, where we beat past methods by 7.4% -- 66%. In system level comparisons, WildHands achieves the best 3D hand pose score on the egocentric split of the ARCTIC, beats the popular FrankMocap system and is competitive the concurrent HaMeR system.

# 200

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Zicong Fan · Takehiko Ohkawa · Linlin Yang · NIE LIN · Zhishan Zhou · Shihao Zhou · Jiajun Liang · Zhong Gao · Xuanyang Zhang · Xue Zhang · Fei Li · Liu Zheng · Feng Lu · Karim Abou Zeid · Bastain Leibe · Jeongwan On · Seungryul Baek · Aditya Prakash · Saurabh Gupta · Kun He · Yoichi Sato · Otmar Hilliges · Hyung Jin Chang · Angela Yao

We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions.

# 201

AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale

Keenon Werling · Janelle M Kaneda · Tian Tan · Rishi Agarwal · Six Skov · Tom Van Wouwe · Scott Uhlrich · Scott Delp · Karen Liu · Nicholas A Bianco · Carmichael Ong · Antoine Falisse · Shardul Sapkota · Aidan Jai Chandra · Joshua A Carter · Ezio Preatoni · Benjamin J Fregly · Jennifer Hicks

While reconstructing human poses in 3D from inexpensive sensors has advanced significantly in recent years, quantifying the dynamics of human motion, including the muscle-generated joint torques and external forces, remains a challenge. Prior attempts to estimate physics from reconstructed human poses have been hampered by a lack of datasets with high-quality pose and force data for a variety of movements. We present the AddBiomechanics Dataset 1.0, which includes physically accurate human dynamics of 273 human subjects, over 70 hours of motion and force plate data, totaling more than 24 million frames. To construct this dataset, novel analytical methods were required, which are also reported here. We propose a benchmark for estimating human dynamics from motion using this dataset, and present several baseline results. The AddBiomechanics Dataset is publicly available at [link available upon acceptance]

# 193

Strong Double Blind

Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

Chuanrui Zhang · Yonggen Ling · Minglei Lu · Minghan Qin · Haoqian Wang

We study the 3D object understanding task for manipulating everyday objects with different material properties (diffuse, specular, transparent and mixed). Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or imprecise depth measurements. We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images. The base of our pipeline is an implicit stereo matching module that combines stereo image features with 3D position information. Concatenating this presented module and the following transform-decoder architecture leads to end-to-end learning of multiple tasks required by robot manipulation. Our approach significantly outperforms all competing methods in the public TOD dataset. Furthermore, trained on simulated data, CODERS generalize well to unseen category-level object instances in real-world robot manipulation experiments. Our dataset, code, and demos will be available on our project page.

# 271

Strong Double Blind

Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation

zhe zhao · Mengshi Qi · Huadong Ma

Generating realistic human grasps is a crucial yet challenging task for applications involving object manipulation in computer graphics and robotics. Existing methods often struggle with generating fine-grained realistic human grasps that ensure all fingers effectively interact with objects, as they focus on encoding hand with the whole representation and then estimating both hand posture and position in a single step. In this paper, we propose a novel Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE) to address this limitation by decomposing hand into several distinct parts and encoding them separately. This part-aware decomposed architecture facilitates more precise management of the interaction between each component of hand and object, enhancing the overall reality of generated human grasps. Furthermore, we design a newly dual-stage decoding strategy, by first determining the type of grasping under skeletal physical constraints, and then identifying the location of the grasp, which can greatly improve the verisimilitude as well as adaptability of the model to unseen hand-object interaction. In experiments, our model achieved about 14.1% relative improvement in the quality index compared to the state-of-the-art methods in four widely-adopted benchmarks. We will release the code and model.

# 190

Strong Double Blind

CliffPhys: Camera-based Respiratory Measurement using Clifford Neural Networks

Omar Ghezzi · Giuseppe Boccignone · Giuliano Grossi · Raffaella Lanzarotti · Alessandro D'Amelio

This paper presents CliffPhys, a family of models that leverage hypercomplex neural architectures for camera-based respiratory measurement. The proposed approach extracts respiratory motion from standard RGB cameras, relying on optical flow and monocular depth estimation to obtain a 2D vector field and a scalar field, respectively. We show how the adoption of Clifford Neural Layers to model the geometric relationships within the recovered input fields allows to effectively recover respiratory information. Experimental results in three publicly available datasets demonstrate CliffPhys' superior performance compared to both baselines and recent neural approaches, achieving state-of-the-art results in the prediction of respiratory rates. Source code available at: https://anonymous.4open.science/r/CliffPhys-2D26/.

# 196

Strong Double Blind

Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Yihao Ai · Yifei Qi · Bo Wang · Yu Cheng · Xinchao Wang · Robby T. Tan

Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the use of paired well-lit and low-light images with ground truths for training, which are impractical due to the inherent challenges associated with annotation on low-light images. To this end, we introduce a novel approach that eliminates the need for low-light ground truths. Our primary novelty lies in leveraging two complementary-teacher networks to generate more reliable pseudo labels, enabling our model achieves competitive performance on extremely low-light images without the need for training with low-light ground truths. Our framework consists of two stages. In the first stage, our model is trained on well-lit data with low-light augmentations. In the second stage, we propose a dual-teacher framework to utilize the unlabeled low-light data, where a center-based main teacher produces the pseudo labels for relatively visible cases, while a keypoints-based complementary teacher focuses on producing the pseudo labels for the missed persons of the main teacher. With the pseudo labels from both teachers, we propose a person-specific low-light augmentation to challenge a student model in training to outperform the teachers. Experimental results on real low-light dataset (ExLPose-OCN) show, our method achieves 6.8% (2.4 AP) improvement over the state-of-the-art (SOTA) method, despite no low-light ground-truth data is used in our method, in contrast to the SOTA method.

# 212

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

Yiqun Duan · Xianda Guo · Zheng Zhu

Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image. Current methods typically model this problem as a regression or classification task. We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions. The process is performed in the latent space encoded by a dedicated depth encoder and decoder. Instead of diffusing ground truth (GT) depth, the model learns to reverse the process of diffusing the refined depth of itself into random depth distribution. This self-diffusion formulation overcomes the difficulty of applying generative models to sparse GT depth scenarios. The proposed approach benefits this task by refining depth estimation step by step, which is superior for generating accurate and highly detailed depth maps. Experimental results on KITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion approach could reach state-of-the-art performance in both indoor and outdoor scenarios with acceptable inference time.

# 227

Strong Double Blind

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

Sudhir Kumar Reddy Yarram · Junsong Yuan

Video extrapolation in space and time (VEST) enables viewers to forecast a 3D scene into the future and view it from novel viewpoints. Recent methods propose to learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together, while assuming simplified affine motion and homography-based warping at each scene layer, leading to inaccurate video extrapolation. Instead of entangled scene representation and rendering, our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds, which enables high quality rendering of future videos from novel views. To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects (e.g., cars, people). This approach ensures more precise motion predictions by reducing inaccuracies from entanglement of ego-motion with dynamic object motion, where better ego-motion forecasting could significantly enhance the visual outcomes. Extensive experimental analysis on two urban scene datasets demonstrate superior performance of our proposed method in comparison to strong baselines.

# 184

Strong Double Blind

Deep Patch Visual SLAM

Lahav Lipson · Zachary Teed · Jia Deng

Recent work in Visual Odometry and SLAM has shown the effectiveness of using deep network backbones. Despite excellent performance, such approaches are often expensive to run or do not generalize well zero-shot. To address this problem, we introduce Deep Patch Visual-SLAM, a new system for monocular visual SLAM based on the DPVO visual odometry system. We introduce (1) a backend for long-term loop-closure, and (2) a separate mid-term backend with efficient global optimization. On real-world datasets, DPV-SLAM runs at 2x real-time framerates. We achieve the same accuracy as DROID-SLAM on EuRoC while running twice as fast using a third of the VRAM. We also outperform DROID-SLAM by large margins on KITTI and TartanAir.

# 177

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Li Mi · Chang Xu · Javiera Castillo Navarro · SYRIELLE MONTARIOL · Wen Yang · Antoine Bosselut · Devis TUIA

Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires handling diverse ground images captured by users with varying orientations and field of views (FoVs). However, existing learning pipelines are orientation-specific or FoV-specific, requiring separate training for different settings. Such models heavily depend on the North-aligned spatial correspondence and specific FoVs in the training data, compromising the models' robustness in ground view variations. We propose ConGeo, a single- and cross-modal Contrastive method for Geo-localization: it enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location. As a generic learning objective for cross-view geo-localization, when integrated into state-of-the-art pipelines, ConGeo significantly boosts the performance of three base models on four geo-localization benchmarks for diverse ground view variations and outperforms competing methods that train separate models for each ground view variation.

# 172

Strong Double Blind

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

Manu S Pillai · Mamshad Nayeem Rizve · Shah Mubarak

Cross-view video geo-localization (CVGL) aims to obtain a GPS trajectory of a street-view video by matching it with reference aerial-view images. Despite exhibiting promising performance, current CVGL methods face key limitations. They often rely on camera intrinsic and odometry information, utilize context from multiple frames to obtain frame-level features, leading to high computational overhead, and generate temporally inconsistent GPS trajectories by independently retrieving each street-view frame. To address these challenges, in this work, we propose TransCVGL, the first fully transformer-based method for cross-view video geo-localization. We hypothesize that video geo-localization does not require complex temporal modeling, unlike other common video understanding tasks such as action recognition. Instead, we demonstrate that the representations from a street-view geo-localization model can be efficiently aggregated to obtain video-level representation. To achieve this, we propose a transformer-adapter module, GeoAdapter, to aggregate image-level representations of an image geo-localization model and to adapt it to video inputs. Furthermore, to ensure temporally consistent GPS predictions, we introduce TransRetriever, the first transformer-based approach that models independent frame retrievals through an auto-regressive transformer decoder. Finally, we validate the efficacy of our method through extensive experiments, showcasing state-of-the-art performance in benchmark datasets.

# 186

SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection

Hongcheng Zhang · Liu Liang · Pengxin Zeng · Xiao Song · Zhe Wang

Sparse 3D detectors have received significant attention since the query-based paradigm embraces low latency without explicit dense BEV feature construction. However, these detectors achieve worse performance than their dense counterparts. In this paper, we find the key to bridging the performance gap is to enhance the awareness of rich representations in two modalities. Here, we present a high-performance fully sparse detector for end-to-end multi-modality 3D object detection. The detector, termed SparseLIF, contains three key designs, which are (1) Perspective-Aware Query Generation (PAQG) to generate high-quality 3D queries with perspective priors, (2) RoI-Aware Sampling (RIAS) to further refine prior queries by sampling RoI features from each modality, (3) Uncertainty-Aware Fusion (UAF) to precisely quantify the uncertainty of each sensor modality and adaptively conduct final multi-modality fusion, thus achieving great robustness against sensor noises. By the time of submission, SparseLIF achieves state-of-the-art performance on the nuScenes dataset, ranking 1st on both validation set and test benchmark, outperforming all state-of-the-art 3D object detectors by a notable margin. The source code will be released upon acceptance.

# 174

Strong Double Blind

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

Dingyuan Zhang · Dingkang Liang · Zichang Tan · Xiaoqing Ye · Cheng Zhang · Jingdong Wang · Xiang Bai

Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. Although many sparse query-based methods have already attempted to improve the efficiency of 3D detectors, they neglect to consider the backbone, especially when using Vision Transformers (ViT) for better performance. To tackle this problem, we explore the efficient ViT backbones for multi-view 3D detection via token compression and propose a simple yet effective method called TokenCompression3D (ToC3D). By leveraging history object queries as foreground priors of high quality, modeling 3D motion information in them, and interacting them with image tokens through the attention mechanism, ToC3D can effectively determine the magnitude of information densities of image tokens and segment the salient foreground tokens. With the proposed dynamic router design, ToC3D can weigh more computing resources to important foreground tokens while compressing the information loss, leading to a more efficient ViT-based multi-view 3D detector. Extensive results on the large-scale nuScenes dataset show that our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution. Code will be made available.

# 75

Strong Double Blind

Image-to-Lidar Relational Distillation for Autonomous Driving Data

Anas Mahmoud · Ali Harakeh · Steven Waslander

Pre-trained on extensive and diverse multi-modal datasets, 2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations. The emergence of 2D-to-3D distillation frameworks has extended these capabilities to 3D models. However, distilling 3D representations for autonomous driving datasets presents challenges like self-similarity, class imbalance, and point cloud sparsity, hindering the effectiveness of contrastive distillation, especially in zero-shot learning contexts. Whereas other methodologies, such as similarity-based distillation, enhance zero-shot performance, they tend to yield less discriminative representations, diminishing few-shot performance. We investigate the gap in structure between the 2D and the 3D representations that result from state-of-the-art distillation frameworks and reveal a significant mismatch between the two. Additionally, we demonstrate that the observed structural gap is negatively correlated with the efficacy of the distilled representations on zero-shot and few-shot 3D semantic segmentation. To bridge this gap, we propose a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation. This alignment significantly enhances 3D representation performance over those learned through contrastive distillation in zero-shot segmentation tasks. Furthermore, our relational loss consistently improves the quality of 3D representations in both in-distribution and out-of-distribution few-shot segmentation tasks, outperforming approaches that rely on the similarity loss.

# 68

Strong Double Blind

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

Ruiyang Zhang · Hu Zhang · Hang Yu · Zhedong Zheng

The open-world 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting small or distant objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D images for open-world 3D detection and introduce a new method, dubbed LiDAR-2D Self-paced Learning (LiSe). We argue that RGB images serve as a valuable complement to LiDAR data, offering precise 2D localization cues, particularly when scarce LiDAR points are available for certain objects. Considering the unique characteristics of both modalities, our framework devises a self-paced learning pipeline that incorporates adaptive sampling and weak model aggregation strategies. The adaptive sampling strategy dynamically tunes the distribution of pseudo labels during training, countering the tendency of models to overfit on easily detected samples, such as nearby and large-sized objects. By doing so, it ensures a balanced learning trajectory across varying object scales and distances. The weak model aggregation component consolidates the strengths of models trained under different pseudo label distributions, culminating in a robust and powerful final model. Experimental evaluations validate the efficacy of our proposed LiSe method, manifesting significant improvements of $+7.1$\% AP$_{BEV}$ and $+3.4$\% AP$_{3D}$ on nuScenes compared to existing techniques. To facilitate further research and replication, we will publicly release the code and pre-trained checkpoints.

# 189

milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing

Fangqiang Ding · Zhen Luo · Peijun Zhao · Chris Xiaoxuan Lu

Human motion sensing plays a crucial role in smart systems for decision-making, user interaction, and personalized services. Extensive research that has been conducted is predominantly based on cameras, whose intrusive nature limits their use in smart home applications. To address this, mmWave radars have gained popularity due to their privacy-friendly features. In this work, we propose milliFlow, a novel deep learning approach to estimate scene flow as complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefiting downstream human motion sensing tasks. Experimental results demonstrate the superior performance of our method when compared with the competing approaches. Furthermore, by incorporating scene flow information, we achieve remarkable improvements in human activity recognition and human parsing, and support human body part tracking. To foster further research in this area, we will provide our codebase and dataset for open access.

# 51

Strong Double Blind

Hetecooper: Feature Collaboration Graph for Heterogeneous Collaborative Perception

Congzhang Shao · Guiyang Luo · Quan Yuan · Yifu Chen · Yilin Liu · Gong Kexin · Jinglin Li

Collaborative perception effectively expands the perception range of agents by sharing perceptual information, and it addresses the occlusion problem in single-vehicle perception. However, the existing collaboration methods all perceive the assumption that the model is isomorphic, and when the agent uses different perception model architectures, it will result in differences in the size, number of channels and the semantic space of intermediate features, which brings challenges to the collaboration. We introduce Hetecooper, a collaborative perception framework for scenarios with heterogeneous perception models. Hetecooper models the correlation between heterogeneous features through a feature collaboration graph. This approach retains the complete information of the features and can automatically adapt to changes in feature size. Moreover, we have designed a method based on the graph transformer to facilitate feature messages transfer within the graph. Initially, the semantic space of the nodes is unified through a semantic mapper. Subsequently, neighbor information is aggregated through attention guided by edge weights. Finally, the graph nodes are reorganized into complete features, thereby achieving effective fusion of heterogeneous features. Test results demonstrate that our method achieves superior performance in both model isomorphism and model heterogeneity scenarios, and also exhibits good scalability. Our code will be open-sourced.

# 48

LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping

Nikhil Gosala · Kürsat Petek · B Ravi Kiran · Senthil Yogamani · Paulo L. J. Drews-Jr · Wolfram Burgard · Abhinav Valada

Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner, and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only 1% of BEV labels and no additional labeled data.

# 162

Probabilistic Image-Driven Traffic Modeling via Remote Sensing

Scott Workman · Armin Hadzic

This work addresses the task of modeling spatiotemporal traffic patterns directly from overhead imagery, which we refer to as image-driven traffic modeling. We extend this line of work and introduce a multi-modal, multi-task transformer-based segmentation architecture that can be used to create dense city-scale traffic models. Our approach includes a geo-temporal positional encoding module for integrating geo-temporal context and a probabilistic objective function for estimating traffic speeds that naturally models temporal variations. We evaluate our method extensively using the Dynamic Traffic Speeds (DTS) benchmark dataset and significantly improve the state-of-the-art. Finally, we introduce the DTS++ dataset to support mobility-related location adaptation experiments.

# 195

Strong Double Blind

Occupancy as Set of Points

Yiang Shi · Tianheng Cheng · Qian Zhang · Wenyu Liu · Xinggang Wang

In this paper, we explore a novel point representation for 3D occupancy prediction from multi-view images, which is named Occupancy as Set of Points. Existing camera-based methods tend to exploit dense volume-based representation to predict the occupancy of the whole scene, making it hard to focus on the special areas or areas out of the perception range. In comparison, we present the Points of Interest (PoIs) to represent the scene and propose OSP, a novel framework for point-based 3D occupancy prediction. Owing to the inherent flexibility of the point-based representation, OSP achieves state-of-the-art performance compared with existing methods and excels in terms of training and inference adaptability. It extends beyond traditional perception boundaries and can be seamlessly integrated with volume-based methods to significantly enhance their effectiveness. Experiments on the Occ3D-nuScenes occupancy benchmark show that OSP has strong performance and flexibility.

# 171

Strong Double Blind

Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation

Yuwen Pan · Rui Sun · Naisong Luo · Tianzhu Zhang · Yongdong Zhang

Semantic segmentation of night-time images holds significant importance in computer vision, particularly for applications like night environment perception in autonomous driving systems. However, existing methods tend to parse night-time images from a day-time perspective, leaving the inherent challenges in low-light conditions (such as compromised texture and deceiving matching errors) unexplored. To address these issues, we propose a novel end-to-end optimized approach, named NightFormer, tailored for night-time semantic segmentation, avoiding the conventional practice of forcibly fitting night-time images into day-time distributions. Specifically, we design a pixel-level texture enhancement module to acquire texture-aware features hierarchically with phase enhancement and amplified attention, and an object-level reliable matching module to realize accurate association matching via reliable attention in low-light environments. Extensive experimental results on various challenging benchmarks including NightCity, BDD and Cityscapes demonstrate that our proposed method performs favorably against state-of-the-art night-time semantic segmentation methods.

# 56

Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction

Zihao Liu · Xiaoyu Zhang · Guangwei Liu · Ji Zhao · Ningyi Xu

In autonomous driving, the high-definition (HD) map plays a crucial role in localization and planning. Recently, several methods have facilitated end-to-end online map construction in DETR-like frameworks. However, little attention has been paid to the potential capabilities of exploring the query mechanism for map elements. This paper introduces MapQR, an end-to-end method with an emphasis on enhancing query capabilities for constructing online vectorized maps. To probe desirable information efficiently, MapQR utilizes a novel query design, called scatter-and-gather query, which is modelled by separate content and position parts explicitly. The base map instance queries are scattered to different reference points and added with positional embeddings to probe information from BEV features. Then these scatted queries are gathered back to enhance information within each map instance. Together with a simple and effective improvement of a BEV encoder, the proposed MapQR achieves the best mean average precision (mAP) and maintains good efficiency on both nuScenes and Argoverse~2. In addition, integrating our query design into other models can boost their performance significantly. The source codes will be released online.

# 54

Online Vectorized HD Map Construction using Geometry

Zhixin Zhang · Yiyuan Zhang · Xiaohan Ding · Fusheng Jin · Xiangyu Yue

Constructing online vectorized High-Definition (HD) maps is critical for downstream prediction and planning. Recent efforts have built strong baselines for this task, however, geometric shapes and relations of instances in road systems are still under-explored, such as parallelism, perpendicular, rectangle-shape, etc. In our work, we propose GeMap (Geometry Map), which end-to-end learns Euclidean shapes and relations of map instances beyond fundamental perception. Specifically, we design a geometric loss based on angle and magnitude clues, robust to rigid transformations of driving scenarios. To address the limitations of the vanilla attention mechanism in learning geometry, we propose to decouple self-attention to handle Euclidean shapes and relations independently. Our method achieves new state-of-the-art performance on the nuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTRv2 by +4.4% and surpassing the 70% mAP threshold for the first time. The code will be available.

# 59

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

Wenzhao Zheng · Weiliang Chen · Yuanhui Huang · Borui Zhang · Yueqi Duan · Jiwen Lu

Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on nuScenes demonstrate the ability of OccWorld to effectively model the driving scene evolutions. OccWorld also produces competitive planning results without using instance and map supervision.

# 50

PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving

Zhili Chen · Maosheng Ye · Shuangjie Xu · Tongyi Cao · Qifeng Chen

We present a new interaction mechanism of prediction and planning for end-to-end autonomous driving, called PPAD (Iterative Interaction of Prediction and Planning Autonomous Driving), which considers the timestep-wise interaction to better integrate prediction and planning. An ego vehicle performs motion planning at each timestep based on the trajectory prediction of surrounding agents (e.g., vehicles and pedestrians) and its local road conditions. Unlike existing end-to-end autonomous driving frameworks, PPAD models the interactions among ego, agents, and the dynamic environment in an autoregressive manner by interleaving the Prediction and Planning processes at every timestep, instead of a single sequential process of prediction followed by planning. Specifically, we design ego-to-agent, ego-to-map, and ego-to-BEV interaction mechanisms with hierarchical dynamic key objects attention to better model the interactions. The experiments on the nuScenes benchmark show that our approach outperforms state-of-the-art methods. Our source code will be made publicly available.

# 53

Strong Double Blind

Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation

Yixiao Wang · Chen Tang · Lingfeng Sun · Simone Rossi · Yichen Xie · Chensheng Peng · Thomas Hannagan · Stefano Sabatini · Nicola Poerio · Masayoshi TOMIZUKA · Wei Zhan

Diffusion models are promising for joint trajectory prediction and controllable generation in autonomous driving, but they face challenges of inefficient inference time and high computational demands. To tackle these challenges, we introduce Optimal Gaussian Diffusion (OGD) and Estimated Clean Manifold (ECM) Guidance. OGD optimizes the prior distribution for a small diffusion time $T$ and starts the reverse diffusion process from it. ECM directly injects guidance gradients to the estimated clean manifold, eliminating extensive gradient backpropagation throughout the network. Our methodology streamlines the generative process, enabling practical applications with reduced computational overhead. Experimental validation on the large-scale Argoverse 2 dataset demonstrates our approach's superior performance, offering a viable solution for computationally efficient, high-quality joint trajectory generation and controllable generation for autonomous driving.

# 52

Strong Double Blind

Learning to Drive via Asymmetric Self-Play

Chris Zhang · Sourav Biswas · Kelvin Wong · Kion Fallah · Lunjun Zhang · Dian Chen · Sergio Casas · Raquel Urtasun

Large-scale data is crucial for learning realistic and capable driving policies. However, it is impractical to naively scale dataset size with real-data alone. The majority of real driving data is uninteresting for learning, and collecting more to cover the long-tail of scenarios is expensive and potentially unsafe. We propose to use asymmetric self-play to scale learning beyond real world data, with additional challenging, solvable, and realistic synthetic scenarios. In particular, we design two agents—a teacher and a student—with asymmetric objectives so that the teacher learns to propose scenarios that itself can pass but the student fails, and the student learns to solve those scenarios. When applied to traffic simulation, our approach learns realistic policies with significantly lower collision rates across urban, highway, and long-tail scenarios. Our approach also zero-shot transfers to generate more effective training data for learning end-to-end autonomy policies, significantly outperforming alternatives like training on adversarial scenarios or real data alone.

# 221

Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Akshay Paruchuri · Samuel Ehrenstein · Shuxian Wang · Inbar Fried · Stephen Pizer · Niethammer Marc · Roni Sengupta

Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues. Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects. In this paper, we utilize the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation. We first create two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. We then propose a novel depth refinement network that leverages the same per-pixel shading representation. Finally, we introduce teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. We achieve state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data.

# 122

I Can't Believe It's Not Scene Flow!

Ishan Khatri · Kyle Vedder · Neehar Peri · Deva Ramanan · James Hays

Current scene flow methods broadly fail to describe motion on small objects, and current scene flow evaluation protocols hide this failure by averaging over many points, with most drawn from larger objects. To fix this evaluation failure, we propose a new evaluation protocol, Bucket Normalized EPE, which is class-aware and speed-normalized, enabling contextualized error comparisons between object types that move at vastly different speeds. To highlight current method failures, we propose a frustratingly simple supervised scene flow baseline, TrackFlow, built by bolting a high-quality pretrained detector (trained using many class rebalancing techniques) onto a simple tracker, that produces state-of-the-art performance on current standard evaluations and large improvements over prior art on our new evaluation. Our results make it clear that all scene flow evaluations must be class and speed aware, and supervised scene flow methods must address point class imbalances. We will release the evaluation code publicly upon publication.

# 180

Strong Double Blind

Motion and Structure from Event-based Normal Flow

Zhongyang Ren · Bangyan Liao · Delei Kong · Jinghang Li · Peidong Liu · Laurent Kneip · Guillermo Gallego · Yi Zhou

Recovering the camera motion and scene geometry from visual data is a fundamental problem in the field of computer vision. Its success in standard vision is attributed to the maturity of feature extraction, data association and multi-view geometry. The recent emergence of neuromorphic event-based cameras places great demands on approaches that use raw event data as input to solve this fundamental problem. Existing state-of-the-art solutions typically infer implicitly data association by iteratively reversing the event data generation process. However, the nonlinear nature of these methods limits their applicability in real-time tasks, and the constant-motion assumption leads to unstable results under agile motion. To this end, we rethink the problem formulation in a way that aligns better with the differential working principle of event cameras. We show that the event-based normal flow can be used, via the proposed geometric error term, as an alternative to the full flow in solving a family of geometric problems that involve instantaneous first-order kinematics and scene geometry. Furthermore, we develop a fast linear solver and a continuous-time nonlinear solver on top of the proposed geometric error term. Experiments on both synthetic and real data demonstrate the superiority of our linear solver in terms of accuracy and efficiency, and indicate its complementary feature as an initialization method for existing nonlinear solvers. Besides, our continuous-time non-linear solver exhibits exceptional capability in accommodating sudden variations in motion since it does not rely on the constant-motion assumption.

# 175

Strong Double Blind

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

Hu Cao · Zehua Zhang · Yan Xia · Xinyi Li · Jiahao Xia · Guang Chen · Alois C. Knoll

In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hierarchical feature refinement network for event-frame fusion. The core concept is the design of the coarse-to-fine fusion module, denoted as the cross-modality adaptive feature refinement (CAFR) module. In the initial phase, the bidirectional cross-modality interaction (BCI) part is used to facilitate information bridging from two distinct sources. Subsequently, the features are further refined by aligning the channel-level mean and variance in the two-fold adaptive feature refinement (TAFR) part. We conducted extensive experiments on two benchmarks: the low-resolution PKU-DDD17-Car dataset and the high-resolution DSEC dataset. Experimental results show that our method outperforms the second-best method by an impressive margin of 8.0% on the DSEC dataset. Furthermore, we introduced 15 different corruption types to the frame images to assess the model's robustness. The results reveal that the proposed method exhibits significantly better robustness (69.5% versus 38.7%) compared to using frames only. The code will be available.

# 181

Strong Double Blind

Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation

Yuhwan Jeong · Hoonhee Cho · Kuk-Jin Yoon

Event cameras with high dynamic range ensure scene capture even in low-light conditions. However, night events exhibit patterns different from those captured during the day. This difference causes performance degradation when applying night events to a model trained solely on day events. This limitation persists due to a lack of annotated night events. To overcome the limitation, we aim to alleviate data imbalance by translating annotated day data into night events. However, generating events from different modalities challenges reproducing their unique properties. Accordingly, we propose an unpaired event-to-event day-to-night translation model that effectively learns to map from one domain to another using Diffusion GAN. The proposed translation model analyzes events in spatio-temporal dimension with wavelet decomposition and disentangled convolution layers. We also propose a new temporal contrastive learning with a novel shuffling and sampling strategy to regularize temporal continuity. To validate the efficacy of the proposed methodology, we redesign metrics for evaluating events translated in an unpaired setting, aligning them with the event modality for the first time. Our framework shows the successful day-to-night event translation while preserving the characteristics of events. In addition, through our translation method, we facilitate event-based modes to learn about night events by translating annotated day events into night events. Our approach effectively mitigates the performance degradation of applying real night events to downstream tasks.

# 215

UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation

Yunfan Lu · Guoqiang Liang · Yusheng Wang · LIN WANG · Hui Xiong

Video frames captured by rolling shutter (RS) cameras during fast camera movement frequently exhibit RS distortion and blur simultaneously. These RS frames can be modeled as a row-wise combination of global shutter (GS) frames within the exposure period. Naturally, recovering high-frame-rate GS sharp frames from an RS blur image must simultaneously consider RS correction, deblur, and frame interpolation. A naive way is to decompose the whole process into separate tasks and cascade existing methods; however, this results in cumulative errors and noticeable artifacts. Event cameras enjoy many advantages, \eg, high temporal resolution, making them potential for our problem. To this end, we propose the \textbf{first} and novel approach, named \textbf{UniINR}, to recover arbitrary frame-rate sharp GS frames from an RS blur image and paired event data. Our key idea is unifying spatial-temporal implicit neural representation (INR) to directly map the position and time coordinates to RGB values to address the interlocking degradations in the image restoration process. Specifically, we introduce spatial-temporal implicit encoding (STE) to convert an RS blur image and events into a spatial-temporal representation (STR). To query a specific sharp frame (GS or RS), we embed the exposure time into STR and decode the embedded features pixel-by-pixel to recover a sharp frame. Our method features a lightweight model with only \textbf{$0.379 M$} parameters, and it also enjoys high inference efficiency, achieving $2.83 ms/frame$ in $31 \times$ frame interpolation of an RS blur frame. Extensive experiments show that our method significantly outperforms prior methods.

# 289

Strong Double Blind

IAM-VFI : Interpolate Any Motion for Video Frame Interpolation with motion complexity map

Kihwan Yoon · Yong Han Kim · Sungjei Kim · Jinwoo Jeong

Within the video, different regions have varying motion complexity, with simple regions containing static or global motion and complex regions containing fast motion or lots of local motion. In recent years, the performance of flow-based Video Frame Interpolation (VFI) algorithms has improved significantly. However, existing training methods train on randomly cropped regions of train data without considering the complexity of the motion. As a result, they cannot handle all regions of the frame that contain varying motion complexity. To solve this problem, we propose a novel VFI approach (IAM-VFI) that can interpolate any motion by considering the motion complexity of all regions in the frame. First, we propose a training data classification method for motion optimization based on each motion complexity. Then, using the proposed data, a flow estimation network generates optimized results for each complexity. Finally, we propose a Motion Complexity Estimation Network (MCENet) to generate a Motion Complexity Map (MCM) that can estimate the motion complexity of each region. Our proposed methods can be easily applied to most flow-based VFI algorithms. Experimental results show that the proposed method can interpolate any motion and significantly improve the performance of existing VFI algorithms.

# 178

Strong Double Blind

Human Motion Forecasting in Dynamic Domain Shifts: A Homeostatic Continual Test-time Adaptation Framework

Qiongjie Cui · Huaijiang Sun · Bin Li · Jianfeng Lu · Weiqing Li

Existing motion forecasting models, while making progress, struggle to bridge the gap between the source and target domains. Recent solutions often rely on an unrealistic assumption that the target domain remains stationary. Due to the ever-changing environment, however, the real-world test distribution may experience ongoing/continual shifts over time, leading to catastrophic forgetting and error accumulation when adapting to evolving domains. To solve these challenges, this work introduces HoCoTTA, a framework for homeostatic continual test-time adaptation. It aligns with the knowledge distillation and parameter isolation paradigm, enabling the identification of domain-invariant and domain-specific knowledge, where the former is shared (to be retained) in continual TTA across domains, while the latter needs to be updated. Specifically, we propose a multi-domain homeostasis assessment to estimate the uncertainty of the current model parameter when faced with novel-domain samples. Then, the Fisher information matrix is computed to measure the parameter sensitivity, with larger indicating the domain-sensitive parameter, and vice versa. Moreover, we propose an isolated parameter optimization strategy to update those domain-specific parameters to adapt to the new-domain, while preserving the invariant ones. In our experimental result, HoCoTTA outperforms the state-of-the-art approaches on several benchmarks, especially excelling in addressing continuous domain drifts, achieving a large improvement. More details and results could be found in the supplementary material.

# 283

DIM: Dyadic Interaction Modeling for Social Behavior Generation

Minh Tran · Di Chang · Maksim Siniukov · Mohammad Soleymani

Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures.

# 288

Strong Double Blind

Length-Aware Motion Synthesis via Latent Diffusion

Alessio Sampieri · Alessio Palma · Indro Spinelli · Fabio Galasso

The target duration of a synthesized human motion is a critical attribute that requires modeling control over the motion dynamics and style. Speeding up an action performance is not merely fast-forwarding it. However, state-of-the-art techniques for human behavior synthesis have limited control over the target sequence length. We introduce the problem of generating length-aware 3D human motion sequences from textual descriptors, and we propose a novel model to synthesize motions of variable target lengths, which we dub "Length-Aware Latent Diffusion” (LADiff). LADiff consists of two new modules: 1) a length-aware variational auto-encoder to learn motion representations with length-dependent latent codes; 2) a length-conforming latent diffusion model to generate motions with a richness of details that increases with the required target sequence length. LADiff significantly improves over the state-of-the-art across most of the existing motion synthesis metrics on the two established benchmarks of HumanML3D and KIT-ML. The code will be open-sourced.

# 296

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Shan Mengyi · Lu Dong · Yutao Han · Yuan Yao · Tao Liu · Ifeoma Nwogu · Guo-Jun Qi · Mitch Hill

This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is already extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.

# 295

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Ke Fan · Junshu Tang · Weijian Cao · Ran Yi · Moran Li · Jingyu Gong · Jiangning Zhang · Yabiao Wang · Chengjie Wang · Lizhuang Ma

Text-to-motion synthesis is a crucial task in computer vision. Existing methods are limited in their universality, as they are tailored for single-person or two-person scenarios and can not be applied to generate motions for more individuals. To achieve the number-free motion synthesis, this paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Furthermore, a generation module and an interaction module are designed for our FreeMotion framework to decouple the process of conditional motion generation and finally support the number-free motion synthesis. Besides, based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion. Extensive experiments demonstrate the superior performance of our method and our capability to infer single and multi-human motions simultaneously.

# 281

Strong Double Blind

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Heeseung Yun · Ruohan Gao · Ishwarya Ananthabhotla · Anurag Kumar · Jacob Donley · Chao Li · Gunhee Kim · Vamsi Krishna Ithapu · Calvin Murdock

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to the wearer’s behaviors. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of the wearer’s head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive image-to-sphere projections. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including active speaker localization in noisy conversations, audio-based spherical sound source localization, and behavior anticipation in everyday activities.

# 285

Explorative Inbetweening of Time and Space

Haiwen Feng · Zheng Ding · Zhihao Xia · Simon Niklaus · Victoria Fernandez Abrevaya · Michael J. Black · Xuaner Zhang

We introduce bounded generation, a generalized method to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strategy, which we call Time Reversal Fusion, that fuses the temporally forward and backward denoising paths conditioned on the start and end frame, respectively. The fused path results in a video that smoothly connects the two frames, generating inbetweening of faithful subject motion, novel views of static scenes, and seamless video looping when the two bounding frames are identical. We curate a diverse evaluation dataset of image pairs and compare against the closest existing methods. We find that Time Reversal Fusion outperforms related work on all subtasks, exhibiting a remarkable ability to generate complex motions and 3D-consistent views guided by an end-frame.

# 292

Strong Double Blind

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

Jeongho Kim · Min-Jung Kim · Junsoo Lee · Choo Jaegul

Recent advances in diffusion model has shed light on Text-to-Video(T2I) and Image-to-Video(I2V) generation. As a line of work, pose driven video generation with reference image also gained attention, showing the capability of realistic human dance synthesis. However, previous methods have some remaining challenges. First, the network that encodes the pose information is fine-tuned using the pose videos from the target domain, thus lacking generalizability to diverse poses. Second, as the models are driven by the provided pose videos, the outcomes inevitably depend on the performance of off-the-shelf pose detector. In this paper, we present pose driven video generation methods with reference image that mitigates the aforementioned issues. Unlike previous methods, we utilize the pretrained ControlNet without fine-tuning to leverage its preaquired knowledge from a vast amount of pose-image-caption pairs. To remain the controlnet frozen, we introduce a correspondence layer, enabling the network to train the correspondence between the pose and appearance features. Additionally, by introducing additional temporal layer to the ControlNet, we enhance robustness with respect to pose detector outliers. Extensive experiments demonstrate that the proposed method can achieve promising results in video synthesis tasks, encompassing various poses.

# 291

Strong Double Blind

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

Zijian He · Peixin Chen · Guangrun Wang · Guanbin Li · Philip Torr · Liang Lin

Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencdoer for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos.

# 303

Pix2Gif: Motion-Guided Diffusion for GIF Generation

Hitesh Kandala · Jianfeng Gao · Jianwei Yang

We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts, as shown in teaser figure. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs.

# 316

Factorizing Text-to-Video Generation by Explicit Image Conditioning

Rohit Girdhar · Mannat Singh · Andrew Brown · Quentin Duval · Samaneh Azadi · Sai Saketh Rambhatla · Mian Akbar Shah · Xi Yin · Devi Parikh · Ishan Misra

We present FACT2V, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions–adjusted noise schedules for diffusion, and multi-stage training— that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video. Our model outperforms commercial solutions such as RunwayML’s Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work.

# 307

Strong Double Blind

DNI: Dilutional Noise Initialization for Diffusion Video Editing

Sunjae Yoon · Gwanhyeong Koo · Ji Woo Hong · Chang Yoo

Text-based diffusion video editing systems have been successful in performing edits with high fidelity and textual alignment. However, this success is limited to rigid-type editing such as style transfer and object overlay, while preserving the original structure of the input video. This limitation stems from an initial latent noise employed in diffusion video editing systems. The diffusion video editing systems prepare initial latent noise to edit by gradually infusing Gaussian noise onto the input video. However, we observed that the visual structure of the input video still persists within this initial latent noise, thereby restricting non-rigid editing such as motion change necessitating structural modifications. To this end, this paper proposes Dilutional Noise Initialization (DNI) framework which enables editing systems to perform precise and dynamic modification including non-rigid editing. DNI introduces a concept of 'noise dilution' which adds further noise to the latent noise in the region to be edited to soften the structural rigidity imposed by the input video, resulting in more effective edits closer to the target prompt. Extensive experiments demonstrate the adaptability and effectiveness of the DNI framework. The code will be made publicly available.

# 305

DATENeRF: Depth-Aware Text-based Editing of NeRFs

Sara Rojas Martinez · Julien Philip · Kai Zhang · Sai Bi · Fujun Luan · Bernard Ghanem · Kalyan Sunkavalli

Recent diffusion models have demonstrated impressive capabilities for text-based 2D image editing. Applying similar ideas to edit a NeRF scene remains challenging as editing 2D frames individually does not produce multiview-consistent results. We make the key observation that the geometry of a NeRF scene provides a way to unify these 2D edits. We leverage this geometry in depth-conditioned ControlNet to improve the consistency of individual 2D image edits. Furthermore, we propose an inpainting scheme that uses the NeRF scene depth to propagate 2D edits across images while staying robust to errors and resampling issues. We demonstrate that this leads to more consistent, realistic and detailed editing results compared to previous state-of-the-art text-based NeRF editing methods.

# 304

FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

Wei WU · Qingnan Fan · Shuai Qin · Hong Gu · Ruoyu Zhao · Antoni Chan

Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive \textbf{Fre}qu\textbf{e}ncy truncation to refine the guidance of \textbf{Diff}usion models for universal editing tasks (\textbf{FreeDiff}). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.

# 313

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Rohit Gandikota · Joanna Materzynska · Tingrui Zhou · Antonio Torralba · David Bau

We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either textual or visual concepts. Concept Sliders are plug-and-play: they can be composed efficiently and continuously modulated, enabling precise control over image generation. In quantitative experiments comparing to previous editing techniques, our sliders exhibit stronger targeted edits with lower interference. We showcase sliders for weather, age, styles, and expressions, as well as slider compositions. We show how sliders can transfer latents from StyleGAN for intuitive editing of visual concepts for which textual description is difficult. We also find that our method can help address persistent quality issues in Stable Diffusion XL including repair of object deformations and fixing distorted hands.

# 338

Strong Double Blind

Using My Artistic Style? You Must Obtain My Authorization

Xiuli Bi · Haowei Liu · Weisheng Li · Bo Liu · Bin Xiao

Artistic images typically contain the unique creative styles of artists. However, it is easy to transfer an artist's style to arbitrary target images using style transfer techniques. To protect styles, some researchers use adversarial attacks to safeguard artists' artistic style images. Prior methods only considered defending against all style transfer models, but artists may allow specific models to transfer their artistic styles properly. To meet such requirements, we propose an Artistic Style Protection Scheme (ASPS). The scheme utilizes adversarial perturbations to introduce biases in the mean and variance of content and style features extracted by unauthorized models while aligning authorized models' content and style features. Additionally, it employs pixel-level and feature-level losses to enhance and degrade the output quality of authorized and unauthorized models, respectively. ASPS requires training only once; during usage, there is no need to see any style transfer models again. Meanwhile, it ensures that the visual quality of the authorized model is unaffected by perturbations. Experimental results demonstrate that our method effectively defends against unauthorized models' indiscriminate use of artistic styles, allowing authorized models to operate normally, thus effectively resolving the issue of controlled authorization regarding artists' artistic styles.

# 294

Strong Double Blind

Learned Image Enhancement via Color Naming

David Serrano-Lozano · Luis Herranz · Michael S Brown · Javier Vazquez-Corral

A popular method for enhancing images involves learning the style of a professional photo editor using pairs of training images comprised of the original input with the editor-enhanced version. When manipulating images, many editing tools offer a feature that allows the user to manipulate a limited selection of familiar colors. Editing by color name allows easy adjustment of elements like the "blue" of the sky or the "green" of trees. Inspired by this approach to color manipulation, we propose a learning-based image enhancement technique that separates the image into a small set of named colors. Our method learns to globally adjust the image for each specific named color via tone curves and then combines the images using an attention-based fusion mechanism to mimic spatial editing. We demonstrate the effectiveness of our method against several competing methods on the well-known Adobe 5K dataset and the PPR10K dataset, showing notable improvements.

# 311

Strong Double Blind

Region-Native Visual Tokenization

Mengyu Wang · Yuyao Huang · Henghui Ding · Xinlong Wang · Tiejun Huang · Yao Zhao · Yunchao Wei · Shuicheng Yan

We explore an innovative region-based visual token representation and present the REgion-native AutoencoDER ("Reader"). In contrast to the majority of previous methods, which represent each image as a grid-shaped tokens map, "Reader" perceives each image into sequential region-based tokens, with each token corresponding to an object or one part of an object in the image. Specifically, "Reader" comprises both an encoder and a decoder. The encoder can partition each image into an adaptive number of arbitrary-shaped regions and encode each region into a token. Subsequently, the decoder utilizes this adaptive-length token sequence to reconstruct the original image. Experimental results demonstrate that such region-based token representation possesses two main notable characteristics. Firstly, it achieves highly efficient image encoding. "Reader" can adaptively use more regions to represent complex areas and fewer regions in simpler ones, thus avoiding information redundancy. Consequently, it achieves superior reconstruction fidelity compared to previous methods, despite using significantly fewer tokens for each image. Secondly, the region-based manner enables manipulation on a local region without causing global changes. As a result, "Reader" inherently supports diverse image editing operations, including erasing, adding, replacing, and modifying shapes on the objects, and achieves excellent performance in the image editing benchmark of smile transferring. Codes will be provided for reproducibility.

# 319

Strong Double Blind

Improving image synthesis with diffusion-negative sampling

Alakh Desai · Nuno Vasconcelos

For image generation with diffusion models (DMs), a negative prompt (n) can be used to complement the text prompt (p), helping define properties not desired in the synthesized image. While this improves prompt adherence and image quality, finding good negative prompts is challenging. We argue that this is due to a semantic gap between humans and DMs, which makes good negative prompts for DMs appear unintuitive to humans. To bridge this gap, we propose a new diffusion-negative prompting (DNP) strategy. DNP is based on a new procedure to sample images that are least compliant with p under the distribution of the DM, denoted as diffusion-negative sampling (DNS). Given p, one such image is sampled, which is then translated into natural language by the user or a captioning model, to produce the negative prompt (n). The pair (p, n) is finally used to prompt the DM. DNS is straightforward to implement and requires no training. Experiments and human evaluations show that DNP performs well both quantitatively and qualitatively and can be easily combined with several DM variants.

# 312

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

Xiangtian Xue · Jiasong Wu · Youyong Kong · Lotfi Senhadji · Huazhong Shu

We present a novel image editing scenario termed Text-grounded Object Generation (TOG), defined as generating a new object in the real image spatially conditioned by textual descriptions. Existing diffusion models exhibit limitations of spatial perception in complex real-world scenes, relying on additional modalities to enforce constraints, and TOG imposes heightened challenges on scene comprehension under the weak supervision of linguistic information. We propose a universal framework ST-LDM based on Swin-Transformer, which can be integrated into any latent diffusion model with training-free backward guidance. ST-LDM encompasses a global-perceptual autoencoder with adaptable compression scales and hierarchical visual features, parallel with deformable multimodal transformer to generate region-wise guidance for the subsequent denoising process. We transcend the limitation of traditional attention mechanisms that only focus on existing visual features by introducing deformable feature alignment to hierarchically refine spatial positioning fused with multi-scale visual and linguistic information. Extensive Experiments demonstrate that our model enhances the localization of attention mechanisms while preserving the generative capabilities inherent to diffusion models. The code will be made publicly available upon publication.

# 309

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

XIAOYU LIU · Yuxiang WEI · Ming LIU · Xianhui Lin · Peiran Ren · xuansong xie · Wangmeng Zuo

Recent text-to-image generation methods such as ControlNet have achieved remarkable success in controlling image layouts, where the generated images by the default model are constrained to strictly follow the visual conditions (e.g., depth maps). However, in practice, the conditions usually provide only rough layout and we argue that the text prompts can more faithfully reflect user intentions. For handling the disagreements between the text prompts and rough visual conditions, we propose a novel text-to-image generation method dubbed SmartControl, which is designed to align well with the text prompts while adaptively keeping useful information from the visual conditions.The key idea of our SmartControl is to relax the constraints on areas that conflict with the text prompts in visual conditions, and two main procedures are required to achieve such a flexible generation. In specific, we extract information from the generative priors of the backbone model (e.g., ControlNet), which effectively represents consistency between the text prompt and visual conditions.Then, a Control Scale Predictor is designed to identify the conflict regions and predict the local control scales. For training the proposed method, a dataset with text prompts and rough visual conditions is constructed. It is worth noting that, even with a limited number (e.g., 1,000～2,000) of training samples, our SmartControl can generalize well to unseen objects.Extensive experiments are conducted on four typical visual condition types, and our SmartControl can achieve a superior performance against state-of-the-art methods. Source code, pre-trained models, and datasets will be publicly available.

# 315

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

Guansong Lu · Yuanfan Guo · Jianhua Han · Minzhe Niu · Yihan Zeng · Xu Songcen · Zeyi Huang · Zhao Zhong · Wei Zhang · Hang Xu

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present PanGu-Draw, a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce Coop-Diffusion, an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model will be released soon on the Ascend platform.

# 321

Strong Double Blind

Visual Text Generation in the Wild

Yuanzhi Zhu · Jiawei Liu · Feiyu Gao · Wenyu Liu · Xinggang Wang · Peng Wang · Fei Huang · Cong Yao · Zhibo Yang

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the images generated by SceneVTG provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

# 323

Strong Double Blind

ReCON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

Chen-yi Lu · Shubham Agarwal · Mehrab Tanjim · Kanak Mahadik · Anup Rao · Subrata Mitra · Shiv K Saini · Saurabh Bagchi · Somali Chaterji

Text-to-image diffusion models excel in generating photo-realistic images but are hampered by slow processing times. Training-free retrieval-based acceleration methods, which leverage pre-generated "trajectories," have been introduced to address this. Yet, these methods often lack diversity and fidelity as they depend heavily on similarities to stored prompts. To address this, we present ReCON (Retrieving CONcepts), an innovative retrieval-based diffusion acceleration method that extracts visual "concepts" from prompts, forming a knowledge base that facilitates the creation of adaptable trajectories. Consequently, ReCON surpasses existing retrieval-based methods, producing high-fidelity images and reducing required Neural Function Evaluations (NFEs) by up to 40%. Extensive testing on MS-COCO, Pick-a-pick, and DiffusionDB datasets confirms that RECON consistently outperforms established methods across multiple metrics such as Pick Score, CLIP Score, and Aesthetics Score. A user study further indicates that 76% of images generated by ReCON are rated as the highest fidelity, outperforming two competing methods, a purely text-based retrieval and a noise similarity-based retrieval.

# 324

Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation

Zhengyuan Yang · Jianfeng Wang · Linjie Li · Kevin Lin · Chung-Ching Lin · Zicheng Liu · Lijuan Wang

We introduce “Idea to Image,” an agent system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. Humans can quickly identify the characteristics of different text-to-image (T2I) models via iterative explorations. This enables them to efficiently convert their high-level generation ideas into effective T2I prompts that can produce good images. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities that enable exploring unknown models or environments via self-refining tries. Idea2Img cyclically generates revised T2I prompts to synthesize draft images, and provides directional feedback for prompt revision, both conditioned on its memory of the probed T2I model’s characteristics. The iterative self-refinement brings Idea2Img various advantages over base T2I models. Notably, Idea2Img can process input ideas with interleaved image-text sequences, follow ideas with design instructions, and generate images of better semantic and visual qualities. The user preference study validates the efficacy of multimodal iterative self-refinement on automatic image design and generation.

# 328

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

Aditya Aravind Chinchure · Pushkar Shukla · Gaurav Bhatt · Kiri Salij · Kartik Hosanagar · Leonid Sigal · Matthew Turk

Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery. At the same time, these models have been shown to suffer from harmful biases, including exaggerated societal biases (e.g., gender, ethnicity), as well as incidental correlations that limit such model's ability to generate more diverse imagery. In this paper, we propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt, using counterfactual reasoning. Unlike other works that evaluate generated images on a predefined set of bias axes, our approach automatically identifies potential biases that might be relevant to the given prompt, and measures those biases. In addition, our paper extends quantitative scores with post-hoc explanations in terms of semantic concepts in the images generated. We show that our method is uniquely capable of explaining complex multi-dimensional biases through semantic concepts, as well as the intersectionality between different biases for any given prompt. We perform extensive user studies to illustrate that the results of our method and analysis are consistent with human judgements.

# 329

Strong Double Blind

Navigating Text-to-Image Generative Bias across Indic Languages

Surbhi Mittal · Arnav Sudan · MAYANK VATSA · RICHA SINGH · Tamar Glaser · Tal Hassner

This research delves into evaluating the effectiveness of text-to-image (T2I) models specifically tailored for Indic languages prevalent across India. It scrutinizes the comparative generative capabilities of popular T2I models in Indic languages specifically against their performance in English. With this benchmark, we meticulously assess 30 Indic languages utilizing 2 open-source diffusion models and 2 commercial APIs for generation. The primary objective of this benchmark is to gauge the adequacy of support offered by these models for Indic languages while pinpointing areas that require enhancement. With a linguistic diversity encompassing 30 languages spoken by a population exceeding a billion, the benchmark endeavors to deliver a thorough and insightful evaluation of T2I models within the realm of Indic languages.

# 325

Strong Double Blind

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Fanyue Wei · Wei Zeng · Zhenyang Li · Dawei Yin · Lixin Duan · Wen Li

Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using the diffusion-based methods, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based methods usually adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated image and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of generated images. Experimental results on personalized text-to-image generation benchmark datasets show that our proposed approach surpasses existing state-of-the-art methods by a large margin on visual fidelity while preserving the text-alignment.

# 317

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Zhao Tianchen · Xuefei Ning · Tongcheng Fang · Enshu Liu · Guyue Huang · Zinan Lin · Shengen Yan · Guohao Dai · Yu Wang

Few-step diffusion models, which enable high-quality text-to-image generation with only a few denoising steps, have substantially reduced inference time. However, considerable memory consumption (5-10GB) still poses limitations for practical deployment on mobile devices. Post-Training Quantization (PTQ) proves to be an effective method for enhancing efficiency in both memory and operational complexity. However, when applied to few-step diffusion models, existing methods designed for multi-step diffusion face challenges in preserving both visual quality and text alignment. In this paper, we discover that the quantization is bottlenecked by highly sensitive layers. Consequently, we introduce a mixed-precision quantization method: MixDQ. Firstly, we identify some highly sensitive layers are caused by outliers in text embeddings, and design a specialized Begin-Of-Sentence (BOS)-aware quantization to address this issue. Subsequently, we investigate the drawback of existing sensitivity metrics, and introduce metric-decoupled sensitivity analysis to accurately estimate sensitivity for both image quality and content. Finally, we develop an integer-programming-based method to obtain the optimal mixed-precision configuration. In the challenging 1-step Stable Diffusion XL text-to-image task, current quantization methods fall short at W8A8. Remarkably, MixDQ achieves W3.6A16 and W4A8 quantization with negligible degradation in both visual quality and text alignment. Compared with FP16, it achieves 3-4x reduction in model size and memory costs, along with a 1.5x latency speedup.

# 330

Strong Double Blind

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Sanghyun Kim · Seohyeon Jung · Balhae Kim · Moonseok Choi · Jinwoo Shin · Juho Lee

This paper addresses the societal concerns arising from large-scale text-to-image diffusion models for generating potentially harmful or copyrighted content. Existing models rely heavily on internet-crawled data, wherein problematic concept persists due to incomplete filtration processes. While previous approaches somewhat alleviate the issue, they often rely on text-specified concepts, introducing challenges in accurately capturing nuanced concepts and aligning model knowledge with human understandings. In response, we propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images. The proposed framework can be built upon existing techniques for the same purpose, enhancing their alignment with human judgment. By doing so, we simplify the training objective with a self-distillation-based technique, providing a strong baseline for concept removal. Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.

# 318

LCM-Lookahead for Encoder-based Text-to-Image Personalization

Rinon Gal · Or Lichter · Elad Richardson · Or Patashnik · Amit Bermano · Gal Chechik · Danny Cohen-Or

Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by augmenting their training with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment.

# 340

Robust-Wide: Robust Watermarking against Instruction-driven Image Editing

Runyi Hu · Jie Zhang · Ting Xu · Jiwei Li · Tianwei Zhang

Instruction-driven image editing allows users to quickly edit an image according to text instructions in a forward pass. Nevertheless, malicious users can easily exploit this technique to create fake images, which could cause a crisis of trust and harm the rights of the original image owners. Watermarking is a common solution to trace such malicious behavior. Unfortunately, instruction-driven image editing can significantly change the watermarked image at the se- mantic level, making it less robust and effective. We propose Robust-Wide, the first robust watermark- ing methodology against instruction-driven image editing. Specifically, we adopt the widely-used encoder-decoder framework for watermark embedding and extraction. To achieve robustness against semantic distortions, we intro- duce a novel Partial Instruction-driven Denoising Sampling Guidance (PIDSG) module, which consists of a large vari- ety of instruction injections and substantial modifications of images at different semantic levels. With PIDSG, the encoder tends to embed the watermark into more robust and semantic-aware areas, which remains in existence even after severe image editing. Experiments demonstrate that Robust-Wide can effectively extract the watermark from the edited image with a low bit error rate of nearly 2.6% for 64-bit watermark messages. Meanwhile, it only induces a neglectable influence on the visual quality and editability of the original images. Moreover, Robust-Wide holds general robustness against different sampling configura- tions and other image editing methods such as ControlNet- InstructPix2Pix, MagicBrush, Inpainting and DDIM Inver- sion.

# 101

Strong Double Blind

COIN-Matting: Confounder Intervention for Image Matting

Zhaohe Liao · Jiangtong Li · Jun Lan · Huijia Zhu · Weiqiang Wang · Li Niu · Liqing Zhang

Deep learning methods have significantly advanced the performance of image matting. However, dataset biases can mislead the matting models to biased behavior. In this paper, we identify the two typical biases in existing matting models, specifically contrast bias and transparency bias, and discuss their origins in matting datasets. To address these biases, we model the image matting task from the perspective of causal inference and identify the root causes of these biases: the confounders. To mitigate the effects of these confounders, we employ causal intervention through backdoor adjustment and introduce a novel model-agnostic cofounder intervened (COIN) matting framework. Extensive experiments across various matting methods and datasets have demonstrated that our COIN framework can significantly diminish such biases, thereby enhancing the performance of existing matting models.

# 27

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

Junhao Zhang · Mutian Xu · Jay Zhangjie Wu · Chuhui Xue · Wenqing Zhang · XIAOGUANG HAN · Song Bai · Mike Zheng Shou

This paper studies visual representation learning with diffusion-generated synthetic images. We start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent representation learning methods i.e., contrastive learning, masked modeling, and vision-language pretraining) on diffusion-generated synthetic data and introduce customized solutions by fully exploiting the aforementioned free attention masks, namely Free-ATM. Comprehensive experiments demonstrate Free-ATM's ability to enhance the performance of various representation learning frameworks when utilizing synthetic data. This improvement is consistent across diverse downstream tasks including image classification, detection, segmentation and image-text retrieval. Meanwhile, by utilizing Free-ATM, we can accelerate the pretraining on synthetic images significantly and close the performance gap between representation learning on synthetic data and real-world scenarios.

# 297

ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

Daniel Winter · Matan Cohen · Shlomi Fruchter · Yael Pritch · Alex Rav-Acha · Yedid Hoshen

Diffusion models have revolutionized image editing but often generate images that violate physical laws, particularly the effects of objects on the scene, e.g., occlusions, shadows, and reflections. By analyzing the limitations of self-supervised approaches, we propose a practical solution centered on a "counterfactual" dataset. Our method involves capturing a scene before and after removing a single object, while minimizing other changes. By fine-tuning a diffusion model on this dataset, we are able to not only remove objects but also their effects on the scene. However, we find that applying this approach for photorealistic object insertion requires an impractically large dataset. To tackle this challenge, we propose bootstrap supervision; leveraging our object removal model trained on a small counterfactual dataset, we synthetically expand this dataset considerably. Our approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly at modeling the effects of objects on the scene.

# 255

Strong Double Blind

Data Augmentation via Latent Diffusion for Saliency Prediction

Bahar Aydemir · Deblina Bhattacharjee · Tong Zhang · Mathieu Salzmann · Sabine Süsstrunk

Saliency prediction models are constrained by the limited diversity and quantity of labeled data. Standard data augmentation techniques such as rotating, and cropping change the scene composition hence affecting saliency. In this work, we propose a novel data augmentation method for deep saliency prediction that involves editing natural images while retaining the complexity and variability of real-world visual scenes. Since saliency depends on both high-level and low-level features such as semantics and photometric properties, our approach involves learning both by incorporating photometric and semantic attributes such as color, contrast, brightness, and class. To that end, we introduce a saliency-guided cross-attention mechanism that enables targeted edits on the photometric properties, thereby enhancing saliency within specific image regions and providing controllability to our model in the context of saliency prediction. Experimental results demonstrate that our data augmentation method consistently improves the performance of various saliency models. Moreover, leveraging these features that generate augmentation for saliency prediction yields superior performance on publicly available saliency benchmarks. Our saliency predictions are highly aligned with human visual attention patterns in the edited images, as validated by a user study. We will make our code publicly available upon acceptance.

# 306

Score Distillation Sampling with Learned Manifold Corrective

Thiemo Alldieck · Nikos Kolotouros · Cristian Sminchisescu

Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects such as oversaturation or repeated detail. Instead, we train a shallow network mimicking the timestep-dependent frequency bias of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.

# 261

Strong Double Blind

Thinking Outside the BBox: Unconstrained Generative Object Compositing

Gemma Canet Tarrés · Zhe Lin · Zhifei Zhang · Jianming Zhang · Yizhi Song · Dan Ruta · Andrew Gilbert · John Collomosse · Soo Ye Kim

Compositing an object into an image involves multiple non-trivial sub-tasks such as object placement and scaling, color/lighting harmonization, viewpoint/geometry adjustment, and shadow/reflection generation. Recent generative image compositing methods leverage diffusion models to handle multiple sub-tasks at once. However, existing models face limitations due to their reliance on masking the original object during training, which constrains their generation to the input mask. Furthermore, obtaining an accurate input mask specifying the location and scale of the object in a new image can be highly challenging. To overcome such limitations, we define a novel problem of unconstrained generative object compositing, i.e., the generation is not bounded by the mask, and train a diffusion-based model on a synthesized paired dataset. Our first-of-its-kind model is able to generate object effects such as shadows and reflections that go beyond the mask, enhancing image realism. Additionally, if an empty mask is provided, our model automatically places the object in diverse natural locations and scales, accelerating the compositing workflow. Our model outperforms existing object placement and compositing models in various quality metrics and user studies.

# 310

Strong Double Blind

Learning Quantized Adaptive Conditions for Diffusion Models

Yuchen Liang · Yuchuan Tian · Lei Yu · Huaao Tang · jie hu · Xiangzhong Fang · Hanting Chen

The curvature of ODE trajectories in diffusion models hinders their ability to generate high-quality images in a few number of function evaluations (NFE). In this paper, we propose a novel and effective approach to reduce trajectory curvature by utilizing auto-encoded guidance. By employing a extremely light-weight quantized guidance encoder, our method incurs only an additional 1% of training parameters, eliminates the need for extra regularization terms, yet achieves significantly better sample quality. In contrast to previous work, our approach retains the key features of score-based diffusion without hindering the use of other acceleration methods. Extensive experiments verify that our method can generate high quality results under extremely limited sampling costs. With only 6 NFE, we achieve 5.14 FID on CIFAR-10, 6.91 FID on FFHQ 64×64 and 3.10 FID on AFHQv2.

# 308

FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models

Junhyuk So · Jungwon Lee · Eunhyeok Park

The substantial computational costs of diffusion models, especially due to the repeated denoising steps necessary for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations (NFE) using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation resources without compromising output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.

# 298

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Ming Li · Taojiannan Yang · Huafeng Kuang · Jie Wu · Zhaoning Wang · xuefeng xiao · Chen Chen

In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.

# 300

Lossy Image Compression with Foundation Diffusion Models

Lucas Relic · Roberto Azevedo · Markus Gross · Christopher Schroers

Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.

# 236

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

yitong jiang · Zhaoyang Zhang · Tianfan Xue · Jinwei Gu

We present AutoDIR, an innovative all-in-one image restoration system incorporating latent diffusion. AutoDIR excels in its ability to automatically identify and restore images suffering from a range of unknown degradations. AutoDIR offers intuitive open-vocabulary image editing, empowering users to customize and enhance images according to their preferences. AutoDIR consists of two key stages: a Blind Image Quality Assessment (BIQA) stage based on a semantic-agnostic vision-language model which automatically detects unknown image degradations for input images, an All-in-One Image Restoration (AIR) stage utilizes structural-corrected latent diffusion which handles multiple types of image degradations. Extensive experimental evaluation demonstrates that AutoDIR outperforms state-of-the-art approaches for a wider range of image restoration tasks. The design of AutoDIR also enables flexible user control (via text prompt) and generalization to new tasks as a foundation model of image restoration.

# 243

Strong Double Blind

QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images

Pengbo Guo · Chengxu Liu · Xingsong Hou · Xueming Qian

Fisheye image rectification aims to correct distortions in images taken with fisheye cameras. Although current models show promising results on images with a similar degree of distortion as the training data, they will produce sub-optimal results when the degree of distortion changes and without retraining. The lack of generalization ability for dealing with varying degrees of distortion limits their practical application. In this paper, we take one step further to enable effective distortion rectification for images with varying degrees of distortion without retraining. We propose a novel Query-based Controllable Distortion Rectification network for fisheye images (QueryCDR). In particular, we first present the Distortion-aware Learnable Query Mechanism (DLQM), which defines the latent spatial relationships for different distortion degrees as a series of learnable queries. Each query can be learned to obtain position-dependent rectification control conditions, providing control over the rectification process. Then, we propose two kinds of controllable modulating blocks to enable the control conditions to guide the modulation of the distortion features better. These core components cooperate with each other to effectively boost the generalization ability of the model at varying degrees of distortion. Extensive experiments on fisheye image datasets with different distortion degrees demonstrate our approach achieves high-quality and controllable distortion rectification.

# 182

MetaWeather: Few-Shot Weather-Degraded Image Restoration

Youngrae Kim · Younggeol Cho · Thanh-Tung Nguyen · Seunghoon Hong · Youngrae Kim

Real-world weather conditions are intricate and often occur concurrently. However, most existing restoration approaches are limited in their applicability to specific weather conditions in training data and struggle to generalize to unseen weather types, including real-world weather conditions. To address this issue, we introduce MetaWeather, a universal approach that can handle diverse and novel weather conditions with a single unified model. Extending a powerful meta-learning framework, MetaWeather formulates the task of weather-degraded image restoration as a few-shot adaptation problem that predicts the degradation pattern of a query image, and learns to adapt to unseen weather conditions through a novel spatial-channel matching algorithm. Experimental results on the BID Task II.A, SPA-Data, and RealSnow datasets demonstrate that the proposed method can adapt to unseen weather conditions, significantly outperforming the state-of-the-art multi-weather image restoration methods. Code is available at https://anonymous.4open.science/r/MetaWeather/, and we plan to release the code officially upon acceptance.

# 173

Strong Double Blind

Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization

Hongtao Wu · Yijun Yang · Angelica I Aviles-Rivero · Jingjing Ren · Sixiang Chen · Haoyu Chen · Lei Zhu

Snow degradations present formidable challenges to the advancement of computer vision tasks by the undesirable corruption in outdoor scenarios. While current deep learning-based desnowing approaches achieve success on synthetic benchmark datasets, they struggle to restore out-of-distribution real-world snowy videos due to the deficiency of paired real-world training data. To address this bottleneck, we devise a new paradigm for video desnowing in a semi-supervised spirit to involve unlabeled real data for the generalizable snow removal. Specifically, we construct a real-world dataset with 85 snowy videos, and then present a Semi-supervised Video Desnowing Network (SemiVDN) equipped by a novel Distribution-driven Contrastive Regularization. The elaborated contrastive regularization mitigates the distribution gap between the synthetic and real data, and consequently maintains the desired snow-invariant background details. Furthermore, based on the atmospheric scattering model, we introduce a Prior-guided Temporal Decoupling Experts module to decompose the physical components that make up a snowy video in a frame-correlated manner. We evaluate our SemiVDN on benchmark datasets and also the collected real snowy data. The experimental results demonstrate the superiority of our approach against state-of-the-art image- and video-level desnowing methods.

# 213

Strong Double Blind

Spatially-Variant Degradation Model for Dataset-free Super-resolution

ShaoJie Guo · Haofei Song · Qingli Li · Yan Wang

This paper focuses on the dataset-free Blind Image Super-Resolution (BISR). Unlike existing dataset-free BISR methods that focus on obtaining a degradation kernel for the entire image, we are the first to explicitly design a spatially-variant degradation model for each pixel. Our method also benefits from having a significantly smaller number of learnable parameters compared to data-driven spatially-variant BISR methods. Concretely, each pixel's degradation kernel is expressed as a linear combination of a learnable dictionary composed of a small number of spatially-variant atom kernels. The coefficient matrices of the atom degradation kernels are derived using membership functions of fuzzy set theory. We construct a novel Probabilistic BISR model with tailored likelihood function and prior terms. Subsequently, we employ the Monte Carlo EM algorithm to infer the degradation kernels for each pixel. Our method achieves a significant improvement over other state-of-the-art BISR methods, with an average improvement of 1 dB (2X). Code will be released.

# 233

Towards Architecture-Agnostic Untrained Networks Priors for Image Reconstruction with Frequency Regularization

Yilin Liu · Yunkui Pang · Jiang Li · Yong Chen · Pew-Thian Yap

Untrained networks inspired by deep image prior have shown promising capabilities in recovering a high-quality image from noisy or partial measurements, without requiring training data. Their success has been widely attributed to the spectral bias acting as an implicit regularization induced by suitable network architectures. However, applications of such network-based priors often entail superfluous architectural decisions, overfitting risks, and slow optimization, all of which hinder their practicality. In this work, we propose efficient, architecture-agnostic methods for a more direct frequency control over the network priors: 1) constraining the bandwidth of the white-noise input, 2) controlling the bandwidth of the interpolation-based upsamplers, and 3) regularizing the Lipschitz constants of the layers. We show that even with just one extra line of code, the overfitting issues in underperforming architectures can be alleviated such that their performance gaps with the high-performing counterparts can be largely closed despite their distinct configurations, mitigating the need for architecture tuning. This then makes it possible to employ a more compact model to achieve similar or superior performance to larger models with greater efficiency. Our regularized network priors compare favorably with current supervised and self-supervised methods on MRI reconstruction and image inpainting tasks, serving as a stronger zero-shot baseline reconstructor. Our code will be made publicly available.

# 293

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Xi Yang · Chenhang He · Jianqi Ma · Yabin Zhang

Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

# 335

Strong Double Blind

Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution

Xingyuan Li · Jinyuan Liu · ZHIXIN CHEN · Yang Zou · Long Ma · Xin Fan · Risheng Liu

Image super-resolution (SR) is a critical technique for enhancing image quality, playing a vital role in image enhancement. While recent advancements, notably transformer-based methods, have advanced the field, infrared image SR remains a formidable challenge. Due to the inherent characteristics of infrared sensors, such as limited resolution, temperature sensitivity, high noise levels, and environmental impacts, existing deep learning methods result in suboptimal enhancement outcomes when applied to infrared images. To address these challenges, we propose a specialized Contourlet residual framework tailored for infrared images to restore and enhance the critical details from the multi-scale and multi-directional infrared spectra decomposition. It precisely captures and amplifies the high-pass subbands of infrared images, such as edge details and texture nuances, which are vital for achieving superior reconstruction quality. Moreover, recognizing the limitations of traditional learning techniques in capturing the inherent characteristics of infrared images, we incorporate a prompt-based learning paradigm. This approach facilitates a more nuanced understanding and targeted optimization process for infrared images by leveraging the semantic comprehension offered by the visual language model. Our approach not only addresses the common pitfalls associated with infrared imaging but also sets a new paradigm for infrared image SR. Extensive experiments demonstrate that our approach obtains superior results against the existing visible image SR models, attaining state-of-the-art performance.

# 229

Strong Double Blind

Image-adaptive 3D Lookup Tables for Real-time Image Enhancement with Bilateral Grids

Wontae Kim · Nam Ik Cho

Image enhancement and restoration methods using adaptive 3D lookup tables (3D LUTs) have shown promising results with real-time inferencing. These methods directly transform input pixel values into enhanced ones by using interpolation operations with predicted 3D LUT values. However, it is still challenging to deal with locally different properties of images since most 3D LUT methods are simple color-to-color transforms. Although including spatial information in this transform can be a good solution, it can significantly increase the number of parameters and inference time. To address this issue, we propose an efficient spatial-aware image enhancement model that combines bilateral grids and 3D LUTs. Specifically, we transform bilateral grids into a spatial feature domain to incorporate spatial information in our 3D LUT model. To reduce inference time and save parameters, we use slicing operations in our network architecture instead of the long decoding path of the U-Net architecture used in most existing studies. Our model achieves state-of-the-art performance without increasing parameters and further reduces inference time, as demonstrated by extensive results.

# 158

Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context

Shashank Agnihotri · Julia Grabinski · Margret Keuper

Pixel-wise predictions are required in a wide variety of tasks such as image restoration, image segmentation, or disparity estimation. Common models involve several stages of data resampling, in which the resolution of feature maps is first reduced to aggregate information and then increased to generate a high-resolution output. Previous works have shown that resampling operations are subject to artifacts such as aliasing. During downsampling, aliases have been shown to compromise the prediction stability of image classifiers. During upsampling, they have been leveraged to detect generated content. Yet, the effect of aliases during upsampling has not yet been discussed w.r.t.~the stability and robustness of pixel-wise predictions. While falling under the same term (aliasing), the challenges for correct upsampling in neural networks differ significantly from those during downsampling: when downsampling, some high frequencies can not be correctly represented and have to be removed to avoid aliases. However, when upsampling for pixel-wise predictions, we actually require the model to restore such high frequencies that can not be encoded in lower resolutions. The application of findings from signal processing is therefore a necessary but not a sufficient condition to achieve the desirable output. In contrast, we find that the availability of large spatial context during upsampling allows to provide stable, high-quality pixel-wise predictions, even when fully learning all filter weights.

# 161

denoiSplit: a method for joint microscopy image splitting and unsupervised denoising

Ashesh Ashesh · Florian Jug

In this work, we tackle the novel and challenging task of joint image splitting and unsupervised denoising. This dual approach is especially critical in fluorescence microscopy, where noise significantly hinders the analysis of the scientific content hidden in the acquired data. Image splitting involves dissecting an image into predefined semantic structures. Our work builds upon uSplit, the current state-of-the-art method for this task. However, we show that uSplit struggles with noise removal, inadvertently distributing the noise across the split output channels. Here we introduce denoiSplit, a new method that preserves the strengths of uSplit while integrating an unsupervised denoising subtask. This integration results in effective semantic image unmixing but ensures useful results even in the presence of image noise. A key innovation in denoiSplit is the use of specifically formulated noise models into our approach and the suitable adjustment of the KL-divergence loss for the high-dimensional hierarchical latent space we are training. We perform qualitative and quantitative evaluations and compare results to existing benchmarks demonstrating the effectiveness of using denoiSplit: a single network to do both splitting and denoising.

# 159

Strong Double Blind

Region-Aware Sequence-to-Sequence Learning for Hyperspectral Denoising

JiaHua Xiao · Yang Liu · Xing Wei

Properly spectral modeling within hyperspectral image (HSI) is critical yet highly challenging for HSI denoising. In contrast to existing methods that model long-range spectral dependencies with a huge cost and directly explore spatial-spectral information without region discrimination, we introduce RAS2S—a simple yet effective sequence-to-sequence (Seq2Seq) learning framework for better HSI denoising. RAS2S treats HSI denoising as a Seq2Seq translation problem, which converts the noisy spectral sequence to its clean ones in an autoregressive fashion. In addition, spatial-spectral information exploration without region discrimination contradicts the intrinsic spatial-spectral diversity of HSIs, leading to negative interference from spatial-spectral unrelated regions. Thus we propose a novel spatial-spectral region-aware module to distinctively perceive the semantic regions with different spatial-spectral representations, maximizing the spectral modeling potential of Seq2Seq learning. With such an improved Seq2Seq learning paradigm, RAS2S not only shows huge potential in capturing long-range spectral dependencies, but also maintains the flexibility to handle HSIs with arbitrary spectral numbers. Extensive experiments demonstrate that RAS2S outperforms existing state-of-the-art methods quantitatively and qualitatively with a minimal model size, merely 0.08M.

# 228

Strong Double Blind

CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems

Jiankun Zhao · Bowen Song · Liyue Shen

Diffusion models have been demonstrated as strong priors for solving general inverse problems. Most existing Diffusion model-based Inverse Problem Solvers (DIS) employ a plug-and-play approach to guide the sampling trajectory with either projections or gradients. Though effective, these methods generally necessitate hundreds of sampling steps, posing a dilemma between inference time and reconstruction quality. In this work, we try to push the boundary of inference steps to 1-2 NFEs while still maintaining high reconstruction quality. To achieve this, we propose to leverage a pretrained distillation of diffusion model, namely consistency model, as the data prior. The key to achieving few-step guidance is to enforce two types of constraints during the sampling process of the consistency model: soft measurement constraint with ControlNet and hard measurement constraint via optimization. Supporting both single-step reconstruction and multistep refinement, the proposed framework further provides a way to trade image quality with additional computational cost. Within comparable NFEs, our method achieves new state-of-the-art in diffusion-based inverse problem solving, showcasing the significant potential of employing prior-based inverse problem solvers for real-world applications.

# 231

Strong Double Blind

Plug-and-Play Learned Proximal Trajectory for 3D Sparse-View X-Ray Computed Tomography

Romain Vo · Julie Escoda · Caroline Vienne · Etienne Decencière

Plug-and-Play algorithms (PnP) have recently emerged as a powerful framework for solving inverse problems in imaging. They leverage the power of Gaussian denoising algorithms to solve complex optimization problems. This work focuses on the challenging task of 3D sparse-view X-ray computed tomography (CT). We propose to replace the Gaussian denoising network in Plug-and-Play with a restoration network, i.e. a network trained to remove arbitrary artifacts. We show that using a restoration prior tailored to the specific inverse problem improves the performances of Plug-and-Play algorithms. Besides, we show that plugging a basic restoration network into a PnP scheme is not sufficient to obtain good results. Thus, we propose a procedure to train the restoration network to be a robust approximation of a proximal operator along a pre-defined optimization trajectory. We demonstrate the effectiveness and scalability of our approach on two 3D Cone-Beam CT datasets and outperform state-of-the-art methods in terms of PSNR.

# 278

Strong Double Blind

Unsupervised Multi-modal Medical Image Registration via Invertible Translation

Mengjie Guo

In medical imaging, the alignment of multi-modal images plays a critical role in providing comprehensive information for image-guided therapies. Despite its importance, multi-modal image registration poses significant challenges due to the complex and often unknown spatial relationships between different image modalities. To address this, we introduce a novel unsupervised translation-based multi-modal registration method, termed Invertible Neural Network-based Registration (INNReg). INNReg consists of an image-to-image translation network that converts multi-modal images into mono-modal counterparts and a registration network that uses the translated mono-modal images to align the multi-modal images. Specifically, to ensure the preservation of geometric consistency after image translation, we introduce an Invertible Neural Network (INN) that leverages a dynamic depthwise convolution-based local attention mechanism. Additionally, we design a novel barrier loss function based on Normalized Mutual Information to impose constraints on the registration network, which enhances the registration accuracy. The superior performance of INNReg is demonstrated through experiments on two public multi-modal medical image datasets, including MRI T1/T2 and MRI/CT pairs.

# 41

Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations

Ofir Shifman · Yair Weiss

Deep neural networks that achieve remarkable performance in image classification have previously been shown to be easily fooled by tiny transformations such as a one pixel translation of the input image. In order to address this problem, two approaches have been proposed in recent years. The first approach suggests using huge datasets together with data augmentation in the hope that a highly varied training set will teach the network to learn to be invariant. The second approach suggests using architectural modifications based on sampling theory to deal explicitly with image translations. In this paper, we show that these approaches still fall short in robustly handling 'natural' image translations that simulate a subtle change in camera orientation. Our findings reveal that a mere one-pixel translation can result in a significant change in the predicted image representation for approximately 40\% of the test images in state-of-the-art models (e.g. open-CLIP trained on LAION-2B or DINO-v2) , while models that are explicitly constructed to be robust to cyclic translations can still be fooled with 1 pixel realistic (non-cyclic) translations 11\% of the time. We present \textbf{R}obust \textbf{I}nference by \textbf{C}rop \textbf{S}election: a simple method that can be proven to achieve any desired level of consistency, although with a modest tradeoff with the model's accuracy. Importantly, we demonstrate how employing this method reduces the ability to fool state-of-the-art models with a 1 pixel translation to less than 5\% while suffering from only a 1\% drop in classification accuracy. Additionally, we show that our method can be easy adjusted to deal with circular shifts as well. In such case we achieve 100\% robustness to integer shifts with \textit{state-of-the-art} accuracy, and with no need for any further training. \keywords{Robustness \and Translation Invariance \and Neural Networks}

# 341

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization

Yixin Yang · Jiangxin Dong · Jinhui Tang · Jinshan Pan

How to effectively explore spatial-temporal features is important for video colorization. Instead of stacking multiple frames along the temporal dimension or recurrently propagating estimated features that will accumulate errors or cannot explore information from far-apart frames, we develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames and alleviate the influence of inaccurately estimated features. To extract better features from each frame for the above-mentioned feature propagation, we explore the features from large-pretrained visual models to guide the feature estimation of each frame so that the estimated features can model complex scenarios. In addition, we note that adjacent frames usually contain similar contents. To explore this property for better spatial and temporal feature utilization, we develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood. We formulate our memory-based feature propagation module, large-pretrained visual model guided feature estimation module, and local attention module into an end-to-end trainable network (named ColorMNet) and show that it performs favorably against state-of-the-art methods on both the benchmark datasets and real-world scenarios.

# 337

Strong Double Blind

Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model

Qi Song · Ziyuan Luo · Ka Chun Cheung · Simon See · Renjie Wan

Neural Radiance Fields (NeRFs) have become a key method for 3D scene representation. With the rising prominence and influence of NeRF, safeguarding its intellectual property has become increasingly important. This paper introduces a plug-and-play method to protect NeRF's copyright during its creation. We propose utilizing a pre-trained watermarking base model, enabling NeRF creators to embed binary messages directly while creating their NeRF. Our plug-and-play property ensures that NeRF creators can flexibly choose NeRF variants without excessive modifications. Leveraging our newly designed progressive distillation, we demonstrate performance on par with several leading-edge methods. Our code will be released upon the acceptance of this paper.

# 339

Finding a needle in a haystack: A Black-Box Approach to Invisible Watermark Detection

Minzhou Pan · Zhenting Wang · Xin Dong · Vikash Sehwag · Lingjuan Lyu · Xue Lin

In this paper, we propose WaterMark Detection (WMD), the first invisible watermark detection method under a black-box and annotation-free setting. WMD is capable of detecting arbitrary watermarks within a given reference dataset using a clean non-watermarked dataset as a reference, without relying on specific decoding methods or prior knowledge of the watermarking techniques. We develop WMD using foundations of offset learning, where a clean non-watermarked dataset enables us to isolate the influence of only watermarked samples in the reference dataset. Our comprehensive evaluations demonstrate the effectiveness of WMD, significantly outperforming naive detection methods, which only yield AUC scores around 0.5. In contrast, WMD consistently achieves impressive detection AUC scores, surpassing 0.9 in most single-watermark datasets and exceeding 0.7 in more challenging multi-watermark scenarios across diverse datasets and watermarking methods. As invisible watermarks become increasingly prevalent, while specific decoding techniques remain undisclosed, our approach provides a versatile solution and establishes a path toward increasing accountability, transparency, and trust in our digital visual content.

# 222

CriSp: Leveraging Tread Depth Maps for Enhanced Crime-Scene Shoeprint Matching

Samia Shafique · Shu Kong · Charless Fowlkes

Shoeprints are a common type of evidence found at crime scenes and are used regularly in forensic investigations. However, existing methods cannot effectively employ deep learning techniques to match noisy and occluded crime-scene shoeprints to a shoe database due to a lack of training data. Moreover, all existing methods match crime-scene shoeprints to clean reference prints, yet our analysis shows matching to more informative tread depth maps yields better retrieval results. The matching task is further complicated by the necessity to identify similarities only in corresponding regions (heels, toes, etc) of prints and shoe treads. To overcome these challenges, we leverage shoe tread images from online retailers and utilize an off-the-shelf predictor to estimate depth maps and clean prints. Our method, named CriSp, matches crime-scene shoeprints to tread depth maps by training on this data. CriSp incorporates data augmentation to simulate crime-scene shoeprints, an encoder to learn spatially-aware features, and a masking module to ensure only visible regions of crime-scene prints affect retrieval results. To validate our approach, we introduce two validation sets by reprocessing existing datasets of crime-scene shoeprints and establish a benchmarking protocol for comparison. On this benchmark, CriSp significantly outperforms state-of-the-art methods in both automated shoeprint matching and image retrieval tailored to this task.

# 45

Strong Double Blind

Noise-assisted Prompt Learning for Image Forgery Detection and Localization

Dong Li · Jiaying Zhu · Xueyang Fu · Xun Guo · Yidi Liu · Gang Yang · Jiawei Liu · Zheng-Jun Zha

We present CLIP-IFDL, a novel image forgery detection and localization (IFDL) model that harnesses the power of Contrastive Language Image Pre-Training (CLIP). However, directly incorporating CLIP in forgery detection poses challenges, given its lack of specific prompts and forgery consciousness. To overcome these challenges, we tailor the CLIP model for forgery detection and localization leveraging a noise-assisted prompt learning framework. This framework comprises instance-aware dual-stream prompt learning and a forgery-enhanced noise adapter. We initially create a pair of learnable prompts as negative-positive samples in place of discrete prompts, then fine-tune these prompts based on each image's features and categories. Additionally, we constrain the text-image similarity between the prompts and their corresponding images to update the prompts. Moreover, We design a forgery-enhanced noise adapter that augments the image encoder's forgery perceptual ability via multi-domain fusion and zero linear layers. By doing so, our method not only extracts pertinent features but also benefits from the generalizability of the open-world CLIP prior. Comprehensive tests indicate that our method outperforms existing ones in terms of accuracy and generalizability while effectively reducing false alarms.

# 42

Strong Double Blind

TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing

Xudong Wang · Ke-Yue Zhang · Taiping Yao · Qianyu Zhou · Shouhong Ding · Pingyang Dai · Rongrong Ji

Generalizable Face anti-spoofing (FAS) approaches have recently garnered considerable attention due to their robustness in unseen scenarios. Some recent methods incorporate vision-language models into FAS, leveraging their impressive pre-trained performance to improve the generalization. However, these methods only utilize coarse-grained or single-element prompts for fine-tuning FAS tasks, without fully exploring the potential of language supervision, leading to unsatisfactory generalization ability. To address these concerns, we propose a novel framework called TF-FAS, which aims to thoroughly explore and harness twofold-element fine-grained semantic guidance to enhance generalization. Specifically, the Content Element Decoupling Module (CEDM) is proposed to comprehensively explore the semantic elements related to content. It is subsequently employed to supervise the decoupling of categorical features from content-related features, thereby enhancing the generalization abilities. Moreover, recognizing the subtle differences within the data of each class in FAS, we present a Fine-Grained Categorical Element Module (FCEM) to explore fine-grained categorical element guidance, then adaptively integrate them to facilitate the distribution modeling for each class. Comprehensive experiments and analysis demonstrate the superiority of our method over state-of-the-art competitors.

# 160

Strong Double Blind

Towards Certifiably Robust Face Recognition

Seunghun Paik · Dongsoo Kim · Chanwoo Hwang · Sunpill Kim · Jae Hong Seo

Adversarial perturbation is a severe threat to deep learning-based systems such as classification and recognition because it makes the system output wrong answers. Designing robust systems against adversarial perturbation in a \textit{certifiable} manner is important, especially for security-related systems such as face recognition. However, most studies for certifiable robustness are about classifiers, which have quite different characteristics from recognition systems for verification; the former is used in the closed-set scenario, whereas the latter is used in the open-set scenario. In this study, we show that, similar to the image classifications, 1-Lipschitz condition is sufficient for certifiable robustness of the face recognition system. Furthermore, for the given pair of facial images, we derive the upper bound of adversarial perturbation where the 1-Lipschitz face recognition system remains robust. At last, we find that this theoretical result should be carefully applied in practice; Applying a training method to typical face recognition systems results in a very small upper bound for adversarial perturbation. We address this by proposing an alternative training method to attain a certifiably robust face recognition system with large upper bounds. All these theoretical results are supported by experiments on proof-of-concept implementation. We released our source to facilitate further study, which is available at \textcolor{red}{github}.

# 165

Strong Double Blind

Oulu Remote-photoplethysmography Physical Domain Attacks Database (ORPDAD)

Marko Savic · Guoying Zhao

Remote photoplethysmography (rPPG) is an emerging technology that can detect the pulse rate remotely from face videos. However, it is easily influenced by the recording environment, as robustness to noise is still an open problem. This vulnerability can therefore be exploited to inject fake signals or impair predictions physically. In this study we propose the first dataset containing a wide set of physical domain attack scenarios divided in three categories (illumination, movement, concealment) that directly target the main weaknesses of rPPG. We propose the rPPG Physical Domain Attacks Database (RPDAD) as a benchmark for evaluation of robustness to physical attacks. We perform extensive experiments on conventional hand-crafted and deep learning (end-to-end, non-end-to-end, CNN, transformer, self-supervised) methods and study their susceptibility to the attacks. We conclude by discussing the most critical vulnerabilities discovered and stress the importance of designing more secure solutions.

# 164

Strong Double Blind

Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement

Haodong LI · Hao LU · Yingcong Chen

Remote photoplethysmography (rPPG) is gaining prominence for its non-invasive approach to monitoring physiological signals using only cameras. Despite its promise, the adaptability of rPPG models to new, unseen domains is hindered due to the environmental sensitivity of physiological signals. To address this issue, we pioneer the Test-Time Adaptation (TTA) in rPPG, enabling the adaptation of pre-trained models to the target domain during inference, sidestepping the need for annotations or source data due to privacy considerations. Particularly, utilizing only the user's face video stream as the accessible target domain data, the rPPG model is adjusted by tuning on each single instance it encounters. However, 1) TTA algorithms are designed predominantly for classification tasks, ill-suited in regression tasks such as rPPG due to inadequate supervision. 2) Tuning pre-trained models in a single-instance manner introduces variability and instability, posing challenges to effectively filtering domain-relevant from domain-irrelevant features while simultaneously preserving the learned information. To overcome these challenges, we present \textbf{Bi-TTA}, a novel domain knowledge-based \textbf{Bi}directional \textbf{T}est-\textbf{T}ime \textbf{A}dapter framework. Specifically, leveraging two expert-knowledge priors for providing self-supervision, our Bi-TTA primarily comprises two modules: a prospective adaptation (PA) module using sharpness-aware minimization to eliminate domain-irrelevant noise, enhancing the stability and efficacy during the adaptation process, and a retrospective stabilization (RS) module to dynamically reinforce crucial learned model parameters, averting performance degradation caused by overfitting or catastrophic forgetting. To this end, we established a large-scale benchmark for rPPG tasks under TTA protocol, promoting advancements in both the rPPG and TTA fields. The experimental results demonstrate the significant superiority of our approach over the state-of-the-art (SoTA).

# 314

Strong Double Blind

Affine steerers for structured keypoint description

Georg Bökman · Johan Edstedt · Michael Felsberg · Fredrik Kahl

We propose a way to train deep learning based keypoint descriptors to make them approximately equivariant under transformations of the image plane that are locally affine. The main idea is to use the representation theory of GL(2) to generalize the recently introduced concept of steerers from rotations to affine transformations. Affine steerers give high control over how keypoint descriptions transform under image transformations. We demonstrate the potential of using this high control for image matching. Finally, we propose a way to finetune keypoint descriptors with a set of steerers on upright images and obtain state-of-the-art results on several standard benchmarks.

# 137

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

Riccardo Fogliato · Pratik Patil · Mathew Monfort · Pietro Perona

Model performance evaluation is a critical and expensive task in machine learning and computer vision. Without clear guidelines, practitioners often rely on a one-time random selection of data for model evaluation. However, by selecting the data strategically, costs can be reduced and estimation accuracy can be improved. In this paper, we propose a statistical framework for efficient model evaluation that includes stratification, sampling design, and estimation components. We examine the statistical properties of each component and evaluate their optimality. One key result of our work is that stratification via $k$-means clustering on accurate predictions of model performance leads to highly efficient estimators. Our experiments on computer vision datasets demonstrate that accuracy estimates obtained via stratified sampling designs consistently and significantly outperform those obtained through simple random sampling, with gains of up to 10x. Furthermore, we find that model-assisted estimators, which leverage predictions of model performance, are often more efficient than the commonly used naive empirical average of the errors.

# 67

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

Sheng Jin · Shuhuai Li · Tong Li · Wentao Liu · Chen Qian · Ping Luo

Human-centric perception (\eg detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data are available at \url{https://github.com/lishuhuai527/COCO-UniHuman}.

# 166

TAPTR: Tracking Any Point with Transformers as Detection

Hongyang Li · Hao Zhang · Shilong Liu · Zhaoyang Zeng · Tianhe Ren · Feng Li · Lei Zhang

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformer (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, each tracking point is represented as a DETR query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through temporal self-attention. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to mitigate the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.

# 124

SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow

Orcun Cetintas · Tim Meinhardt · Guillem Brasó · Laura Leal-Taixe

Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large-scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets in a comprehensive manner. We introduce SPAM, a tracking data engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Hence, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only 3-20% of the human labeling effort. Hence, SPAM paves the way towards highly annotation-efficient large-scale tracking datasets. Our code and models will be available upon acceptance.

# 268

Strong Double Blind

Towards Physical World Backdoor Attacks against Skeleton Action Recognition

Qichen Zheng · Yi Yu · SIYUAN YANG · Jun Liu · Kwok-Yan Lam · Alex Kot

Skeleton Action Recognition (SAR) has attracted significant interest for its efficient representation of the human skeletal structure. Despite its advancements, recent studies have raised security concerns in SAR models, particularly their vulnerability to adversarial attacks. However, such strategies are limited to digital scenarios and ineffective in physical attacks, limiting their real-world applicability. To investigate the vulnerabilities of SAR in the physical world, we introduce the Physical Skeleton Backdoor Attacks (PSBA), the first exploration of physical backdoor attacks against SAR. Considering the practicalities of physical execution, we introduce a novel trigger implantation method that integrates infrequent and imperceivable actions as triggers into the original skeleton data. By incorporating a minimal amount of this manipulated data into the training set, PSBA enables the system misclassify any skeleton sequences into the target class when the trigger action is present. We examine the resilience of PSBA in both poisoned and clean-label scenarios, demonstrating its efficacy across a range of datasets, poisoning ratios, and model architectures. Additionally, we introduce a trigger-enhancing strategy to strengthen attack performance in the clean label setting. The robustness of PSBA is tested against three distinct backdoor defenses, and the stealthiness of PSBA is evaluated using two quantitative metrics. Furthermore, by employing a Kinect V2 camera, we compile a dataset of human actions from the real world to mimic physical attack situations, with our findings confirming the effectiveness of our proposed attacks.

# 269

Strong Double Blind

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

Lehong Wu · Lilang Lin · Jiahang Zhang · Yiyang Ma · Jiaying Liu

Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data.

# 267

Strong Double Blind

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Zhengcen Li · Xinle Chang · Yueran Li · Jingyong Su

Group Activity Recognition (GAR) aims to understand collective activities from videos. Existing solutions for GAR primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead, particularly in dynamic scenarios like sporting events. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate individual annotations and specialized interaction reasoning modules. To address these limitations, we design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity, offering an effective alternative to RGB video. This panoramic graph enables Graph Convolutional Network (GCN) to unify intra-person, inter-person, and person-object interactive modeling through spatial-temporal graph convolutions. In practice, we develop a novel pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and employ Multi-person Panoramic GCN (MP-GCN) to predict group activities. Extensive experiments on Volleyball and NBA datasets demonstrate that the MP-GCN achieves state-of-the-art performance in both accuracy and efficiency. Notably, our method outperforms video-based approaches by using only estimated 2D keypoints as input.

# 272

Strong Double Blind

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang · Ziwei Zheng · Yizeng Han · Hao Cheng · Shiji Song · Gao Huang · Fan Li

Recent neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the recent successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer can aggregate the temporal features within the action timestamps and guarantee the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields to better detect the action instances with diverse ranges from the complex scenes. With the proposed encoder layers and DyHead, the new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen~100, Ego4D-Moment QueriesV1.0, and FineAction. The code is released to \url{https://Anonymous}.

# 81

Strong Double Blind

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Feixiang Zhou · Bryan M. Williams · Hossein Rahmani

Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict conditions, but they typically assess classification and localization quality separately, leading to suboptimal pseudo-label ranking and selection. In particular, there might be inaccurate pseudo labels within selected positives, alongside reliable counterparts erroneously assigned to negatives. To tackle these problems, we propose a novel Adaptive Pseudo-label Learning (APL) framework to facilitate better pseudo-label selection. Specifically, to improve the ranking quality, Adaptive Label Quality Assessment (ALQA) is proposed to jointly learn classification confidence and localization reliability, followed by dynamically selecting pseudo labels based on the joint score. Additionally, we propose an Instance-level Consistency Discriminator (ICD) for eliminating ambiguous positives and mining potential positives simultaneously based on inter-instance intrinsic consistency, thereby leading to a more precise selection. We further introduce a general unsupervised Action-aware Contrastive Pre-training (ACP) to enhance the discrimination both within actions and between actions and backgrounds, which benefits SS-TAL. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate that our method achieves state-of-the-art performance under various semi- supervised settings. All source code will be made publicly available.

# 78

Strong Double Blind

Two-Stage Active Learning for Efficient Temporal Action Segmentation

Yuhao Su · Ehsan Elhamifar

Training a temporal action segmentation (TAS) model on long and untrimmed videos requires gathering framewise video annotations, which is very costly. We propose a two-stage active learning framework to efficiently learn a TAS model using only a small amount of video annotations. Our framework consists of three components that work together in each active learning iteration. 1) Using current labeled frames, we learn a TAS model and action prototypes using a novel contrastive learning method. Leveraging prototypes not only enhances the model performance, but also increases the computational efficiency of both video and frame selection for labeling, which are the next components of our framework. 2) Using the currently learned TAS model and action prototypes, we select informative unlabeled videos for annotation. To do so, we find unlabeled videos that have low alignment scores to learned action prototype sequences in labeled videos. 3) To annotate a small subset of informative frames in each selected unlabeled video, we propose a video-aligned summary selection method and an efficient greedy search algorithm. By evaluation on four benchmark datasets (50Salads, GTEA, Breakfast, CrossTask), we show that our method significantly reduces the annotation costs, while consistently surpassing baselines. We further extend our framework to a semi-supervised active learning setting. We show that we can obtain the full supervision performance using only a small amount of labeled frames. To the best of our knowledge, this is the first work studying active learning for TAS.

# 73

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

Yihong Sun · Bharath Hariharan

Embodied agents must detect and localize objects of interest, e.g., traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised object segmentation but, in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/under-segmentation and irrelevant objects. Inspired both by the human visual system and by practical applications, we posit that the key missing cue is motion: objects of interest are typically mobile objects. We propose a new approach that learns to detect Mobile Objects from Videos for Embodied agents (MOVE). We begin with pseudo-labels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOVE can detect and segment mobile objects from a single static image. Empirically, we achieve state-of-the-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Dataset without using any external data or models.

# 170

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

Shilin Yan · Xiaohao Xu · Renrui Zhang · Lingyi Hong · wenchao chen · Wenqiang Zhang · Wei Zhang

Panoramic videos contain richer spatial information and have attracted tremendous amounts of attention due to their exceptional experience in some fields such as autonomous driving and virtual reality. However, existing datasets for video segmentation only focus on conventional planar images. To address the challenge, in this paper, we present a panoramic video dataset, \textit{i.e.}, PanoVOS. The dataset provides 150 videos with high video resolutions and diverse motions. To quantify the domain gap between 2D planar videos and panoramic videos, we evaluate 15 off-the-shelf video object segmentation (VOS) models on PanoVOS. Through error analysis, we found that all of them fail to tackle pixel-level content discontinues of panoramic videos. Thus, we present a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments demonstrate that compared with the previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking. The dataset, codes, and pre-train models will be made publicly available.

# 168

Strong Double Blind

VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network

Zhixue Fang · Yuzhi Liu · Huisi Wu · Jing Qin

We propose a novel model (VP-SAM) adapted from segment anything model (SAM) for video polyp segmentation (VPS), which is a very challenging task due to (1) the low contrast between polyps and background and (2) the large frame-to-frame variations of polyp size, position, and shape. Our aim is to take advantage of the powerful representation capability of SAM while enabling SAM to effectively harness temporal information of colonoscopic videos and disentangle polyps from background with quite similar appearances. To achieve this, we propose two new techniques. First, we propose a new semantic disentanglement adapter (SDA) by exploiting amplitude information of the Fourier spectrum to facilitate SAM in more effectively differentiating polyps from background. Second, we propose an innovative spatio-temporal side network (STSN) to provide SAM with spatio-temporal information of videos, thus facilitating SAM in effectively tracking the motion status of polyps. Extensive experiments on SUN-SEG, CVC-612, and CVC-300 demonstrate that our method significantly outperforms state-of-the-art methods. While this work focuses on colonoscopic videos, the proposed method is general enough to be used to analyze other medical videos with similar challenges. Codes will be released upon publication.

# 19

PALM: Predicting Actions through Language Models

Sanghwan Kim · Daoji Huang · Yongqin Xian · Otmar Hilliges · Luc Van Gool · Xi Wang

Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional methods heavily rely on representation learning that is trained on a large amount of video data. However, a major challenge arises from the difficulty of obtaining effective video representation. This difficulty stems from the complex and variable nature of human activities, which contrasts with the limited availability of data. In this study, we introduce PALM, an approach that excels in tackling the complex challenges associated with long-term video understanding without the need for extensive training. Specifically, we focus on the task of long-term action anticipation, which aims to forecast forthcoming sequences of actions over an extended period. Our method PALM incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. By leveraging the context provided by these past events, we devise a prompting strategy for action anticipation using large language models (LLMs). Moreover, we implement maximal marginal relevance for example selection to facilitate in-context learning of the LLMs. Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation on the Ego4D benchmark. We further validate PALM on two additional benchmarks, affirming its capacity for generalization across intricate activities with different sets of taxonomies.

# 286

ZeroI2V: Zero-Cost Adaptation of Pre-Trained Transformers from Image to Video

Xinhao Li · Yuhan Zhu · Limin Wang

Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (\ie, introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on five representative video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

# 26

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

Shufan Li · Aditya Grover · Harkanwar Singh

In recent years, Transformers have become the de-facto architecture for sequence modeling on text and a variety of multi-dimensional data, such as images and video. However, the use of self-attention layers in a Transformer incurs prohibitive compute and memory complexity that scales quadratically w.r.t. the sequence length. A recent architecture, Mamba, based on state space models has been shown to achieve comparable performance for modeling text sequences, while scaling linearly with the sequence length. In this work, we present Mamba-ND, a generalized design extending the Mamba architecture to arbitrary multi-dimensional data. Our design alternatively unravels the input data across different dimensions following row-major orderings. We provide a systematic comparison of Mamba-ND with several other alternatives, based on prior multi-dimensional extensions such as Bi-directional LSTMs and S4ND. Empirically, we show that Mamba-ND demonstrates performance competitive with the state-of-the-art on a variety of multi-dimensional benchmarks, including ImageNet-1K classification, HMDB-51 action recognition, and ERA5 weather forecasting.

# 163

Strong Double Blind

VideoMamba: Spatio-Temporal Selective State Space Model

Jinyoung Park · Hee-Seon Kim · Kangwook Ko · Minbeom Kim · Changick Kim

We introduce VideoMamba, a novel adaptation of the pure Mamba architecture, specifically designed for video recognition. Unlike transformers that rely on self-attention mechanisms leading to high computational costs by quadratic complexity, VideoMamba leverages Mamba's linear complexity and selective SSM mechanism for more efficient processing.The proposed Spatio-Temporal Forward and Backward SSM allows the model to effectively capture the complex relationship between non-sequential spatial and sequential temporal information in video.Consequently, VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos, demonstrated by competitive performance and outstanding efficiency on a variety of video understanding benchmarks. Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.

# 290

Strong Double Blind

Text-Guided Video Masked Autoencoder

David Fan · Jue Wang · Shuai Liao · Zhikang Zhang · Vimal Bhat · Xinyu Li

Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of visual cues depends on how often input videos match underlying statistical assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking strategy (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our text-guided masking is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked reconstruction, we next introduce a unified framework for joint MAE and masked video-text contrastive learning. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE on a variety of video recognition tasks, especially for linear probe. When our TGM is combined within this unified framework, we achieve the best relative performance on five action recognition and one egocentric datasets, highlighting the complementary nature of natural language captions for masked video modeling.

# 24

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Xuelu Feng · Dongdong Chen · Junsong Yuan · Chunming Qiao · Gang Hua · Zixin Zhu

In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.

# 18

Strong Double Blind

VISA: Reasoning Video Object Segmentation via Large Language Model

Cilin Yan · haochen wang · Shilin Yan · Xiaolong Jiang · Yao Hu · Guoliang Kang · Weidi Xie · Efstratios Gavves

Existing Video Object Segmentation(VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation(ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA(Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask decoder. Moreover, we establish a comprehensive benchmark consisting of 12,709 instruction-mask sequence pairs from 1,038 diverse videos, which incorporates complex world knowledge reasoning into segmentation tasks for instruction-tuning and evaluation purposes of ReasonVOS models. Experiments conducted on 8 datasets demonstrate the effectiveness of VISA in tackling complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The code and dataset are available at https://anonymous.4open.science/r/VISA-36D6.

# 20

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Yanwei Li · Chengyao Wang · Jiaya Jia

In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is demonstrated to surpass previous methods on most of video- or image-based benchmarks. Code and models will be released to the public.

# 21

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Pilhyeon Lee · Hyeran Byun

Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code will be publicly available.

# 31

Strong Double Blind

COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark

Atsushi Hashimoto · Koki Maeda · Tosho Hirasawa · Jun Harashima · Leszek Rybicki · Yusuke Fukasawa · Yoshitaka Ushiku

Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task, Online Recipe Retrieval (OnRR), and new video captioning domain, Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.

# 30

Strong Double Blind

Audio-visual Generalized Zero-shot Learning the Easy Way

Shentong Mo · Pedro Morgado

Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.

# 16

Strong Double Blind

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Sanjoy Chowdhury · Sayan Nag · Subhrajyoti Dasgupta · jun chen · Mohamed Elhoseiny · Ruohan Gao · Dinesh Manocha

Leveraging Large Language Models’ remarkable proficiency in text-based tasks, recent works on Multimodal-LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred visual grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

# 299

Strong Double Blind

SignGen: End-to-End Sign Language Video Generation with Latent Diffusion

Fan Qi · Yu Duan · Changsheng Xu · Huaiwen Zhang

The seamless transformation of textual input into natural and expressive sign language holds profound societal significance. Sign language is not solely about hand gestures. It encompasses vital facial expressions and mouth movements essential for nuanced communication. Achieving both semantic precision and emotional resonance in text-to-sign language translation is of paramount importance. Our work pioneers direct end-to-end translation of text into sign videos, encompassing a realistic representation of the entire body and facial expressions. We go beyond traditional diffusion models by tailoring the multi-model conditions for sign video. Additionally, our modified motion-aware sign generation framework enhances alignment between text and visual cues in sign language, further improving the quality of the generated sign video. Extensive experiments show that our approach significantly outperforms the state-of-the-art approaches in terms of semantic consistency, naturalness, and expressiveness, presenting benchmark quantitative results on the RWTH-2014, RWTH-2014-T, WLASL, CSL-Daily, and AUTSL.

# 32

Strong Double Blind

TrajPrompt: Aligning Color Trajectory with Vision-Language Representations

Li-Wu Tsao · Hao-Tang Tsui · Yu-Rou Tuan · Pei-Chi Chen · Kuan-Lin Wang · Jhih-Ciang Wu · Hong-Han Shuai · Wen-Huang Cheng

Cross-modal learning has shown promising potential to overcome the limitations of single-modality tasks. However, without a proper design of representation alignment between different data sources, the external modality has no way to exhibit its value. We find that recent trajectory prediction approaches use Bird's-Eye-View (BEV) scene as additional source, but do not significantly improve the performance compared to the single-source strategies. This indicates that the representation of BEV scene and trajectory is not effectively combined. To overcome this problem, we propose TrajPrompt, a prompt-based approach that seamlessly incorporates trajectory representation into the vision-language framework, i.e. CLIP, for BEV scene understanding and future forecasting. We discover that CLIP can attend to the local area of BEV scene by utilizing our innovative design of text prompt and colored lines. Comprehensive results demonstrate TrajPrompt's effectiveness via outperforming the state-of-the-art trajectory predictors by a significant margin (over 35% improvement for ADE and FDE metrics on SDD and DroneCrowd dataset), using fewer learnable parameters than the previous trajectory modeling approaches with scene information included.

# 125

Strong Double Blind

Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification

Chenyue Li · Shuoyi Chen · Mang Ye

Wildlife ReID involves utilizing visual technology to identify specific individuals of wild animals in different scenarios, holding significant importance for wildlife conservation, ecological research, and environmental monitoring. Existing wildlife ReID methods are predominantly tailored to specific species, exhibiting limited applicability. Although some approaches leverage extensively studied person ReID techniques, they struggle to address the unique challenges posed by wildlife. Therefore, in this paper, we present a unified, multi-species general framework for wildlife ReID. Given that high-frequency information is a consistent representation of unique features in various species, significantly aiding in identifying contours and details such as fur textures, we propose the Adaptive High-Frequency Transformer model with the goal of enhancing high-frequency information learning. To mitigate the inevitable high-frequency interference in the wilderness environment, we introduce an object-aware high-frequency selection strategy to adaptively capture more valuable high-frequency components. Notably, we unify the experimental settings of multiple wildlife datasets for ReID and evaluate our model on diverse wildlife datasets, achieving performance superiority over state-of-the-art ReID methods. In domain generalization scenarios, our approach demonstrates robust generalization to unknown species.

# 38

OmniSat: Self-Supervised Modality Fusion for Earth Observation

Guillaume Astruc · Nicolas Gonthier · Clement Mallet · Loic Landrieu

Earth Observation (EO) presents a unique opportunity to explore self-supervised multimodal learning, given its access to vast and diverse data captured by various sensors. However, current multimodal EO datasets and models often consider modalities from a single data type, either mono-date images or time series, which limits their expressivity. We introduce OmniSat, a novel architecture that exploits the natural alignment between multiple EO modalities to learn expressive multimodal representations without labels. We augment an existing dataset with new modalities to demonstrate the advantages of combining modalities of different natures. We evaluate OmniSat and various state-of-the-art approaches on two relevant downstream tasks: forestry and land cover classification. Our results show that OmniSat can learn rich representations in an unsupervised manner, leading to performance improvements in the semi- and fully-supervised settings, even when only one modality is available at inference. Our code, weights, and dataset are available at https://github.com/gastruc/OmniSat.

# 139

Strong Double Blind

Statewide Visual Geolocalization in the Wild

Florian Fervers · Sebastian Bullinger · Christoph Bodensteiner · Michael Arens · Rainer Stiefelhagen

This work presents a method that is able to predict the geolocation of a street-view photo taken in the wild within a state-sized search region by matching it against a database of aerial reference imagery. We partition the search region into geographical cells and train a model to map cells and corresponding photos into a joint embedding space that is used to perform retrieval at test time. The model utilizes aerial images for each cell at multiple levels-of-detail to provide sufficient context for photos with limited field of view. We propose a novel layout of the search region with consistent cell resolutions that allows scaling to large geographical regions. Experiments demonstrate that the method successfully localizes 60.6% of all non-panoramic street-view photos uploaded to the crowd-sourcing platform Mapillary in the state of Massachusetts to within 50m of their ground-truth location. Source code is available at \url{https://github.com/REMOVEDFORREVIEW}.

# 3

Strong Double Blind

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

Hao Luo · Bohan Zhou · Zongqing Lu

Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.

# 35

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

Ming Nie · Renyuan Peng · Chunwei Wang · Xinyue Cai · Jianhua Han · Hang Xu · Li Zhang

Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the reasoning ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. Extendable experiments demonstrate the supportive effect of Reason2Drive towards visual reasoning and downstream planning tasks. The code and dataset will be released.

# 2

Strong Double Blind

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Yanting Yang · Minghao Chen · Qibo Qiu · Jiahao WU · Wenxiao Wang · Binbin Lin · Ziyu Guan · Xiaofei He

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets to train reward functions, human video-language datasets seldom include trivial failure videos. To enhance the model's ability to discriminate between successful and failed robot executions, we cluster failure data with the aspiration that the model identifies patterns in failure videos. For each cluster, we incorporate a newly trained failure prompt into the text encoder to bolster its performance in distinguishing failure from success in robot task executions. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.

# 282

Strong Double Blind

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Taewoong Kim · Cheolhong Min · Byeonghwi Kim · Jinyeon Kim · Wonje Jeung · Jonghyun Choi

Simulated virtual environments have been widely used to learn robotic agents that perform daily household tasks. These environments encourage research progress by far, but often provide limited object interactability, visual appearance different from real-world environments, or relatively smaller environment sizes. This prevents the learned models in the virtual scenes from being readily deployable. To bridge the gap between these learning environments and deploying (i.e., real) environments, we propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks by understanding free-form language instructions and interacting with objects in large, multi-room and 3D-captured scenes. Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps. With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics, encouraging the community to develop methods in more realistic environments. Our code and data are publicly available.

# 1

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Shilong Liu · Hao Cheng · Haotian Liu · Hao Zhang · Feng Li · Tianhe Ren · Xueyan Zou · Jianwei Yang · Hang Su · Jun Zhu · Lei Zhang · Jianfeng Gao · Chunyuan Li

In this paper, we introduce LLaVA-Plus, an end-to-end training approach to systematically expanding the capabilities of large multimodal models (LMM), towards building general-purpose multimodal agents. It maintains a skill repository that contains a wide range of vision and vision-language pre-trained models as multimodal tools. Based on the user instruction and input image, LMM is trained to activate the appropriated tools when needed, grasping skills on the fly and aggregating the tool execution results to complete the real-world tasks in the wild. To facilitate the model capability on learning to use skills, we make the first attempt to build multimodal instruction-following data for tool use, covering skills in visual understanding, generation, external knowledge and their compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities, and extends many new capabilities. Compared with large language model (LLM) based tool use methods, LLaVA-Plus is distinct in that the query image is considered throughout the entire interaction process, yielding higher multimodal tool use performance and enabling new scenarios.

# 4

R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

Xiang Li · Kai Qiu · Jinglu Wang · Xiaohao Xu · Kashu Yamazaki · Hao Chen · Rita Singh · Xiaonan Huang · Bhiksha Raj

Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses the resilience of RPMs against various perturbations in both general and specific contexts. Recognizing the complex nature of referring perception tasks, we present a comprehensive taxonomy of perturbations, and then develop a versatile toolbox for synthesizing and evaluating the effects of compositional disturbances. Employing this toolbox, we construct the R^2-Bench, a benchmark for assessing the Robustness of Referring perception models under noisy conditions across five key tasks. In addition, we propose the R^2-Agent, an LLM-based agent that simplifies and automates model evaluation via natural language instructions. Our investigation reveals the vulnerabilities of current RPMs to various perturbations and provides tools for assessing model robustness, potentially promoting the safe and resilient integration of intelligent systems into complex real-world scenarios.

# 14

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

Sha Zhang · Di Huang · Jiajun Deng · Shixiang Tang · Wanli Ouyang · Tong He · Yanyong Zhang

The ability to understand and reason the 3D real world is a crucial milestone towards artificial general intelligence. The current common practice is to finetune Large Language Models (LLMs) with 3D data and texts to enable 3D understanding. Despite their effectiveness, these approaches are inherently limited by the scale and diversity of the available 3D data. Alternatively, in this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework addressing the 3D scene understanding in a zero-shot manner. The essence of our approach centers on reconceptualizing the challenge of 3D scene perception as a process of understanding and synthesizing insights from multiple images, inspired by how our human beings attempt to understand 3D scenes. By consolidating this idea, we propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. Specifically, given an input 3D scene, Agent3D-Zero first processes a bird's-eye view image with custom-designed visual prompts, then iteratively chooses the next viewpoints to observe and summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints and thus facilitate observing 3D scenes. Extensive experiments demonstrate the effectiveness of the proposed framework in understanding diverse and previously unseen 3D environments.

# 49

PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts

Zewen Chen · Haina Qin · Juan Wang · Chunfeng Yuan · Bing Li · Weiming Hu · Leon Wang

Due to the diversity of assessment requirements in various application scenarios for the IQA task, existing IQA methods struggle to directly adapt to these varied requirements after training. Thus, when facing new requirements, a typical approach is fine-tuning these models on datasets specifically created for those requirements. However, it is time-consuming to establish IQA datasets. In this work, we propose a Prompt-based IQA (PromptIQA) that can directly adapt to new requirements without fine-tuning after training. On one hand, it utilizes a short sequence of Image-Score Pairs (ISP) as prompts for targeted predictions, which significantly reduces the dependency on the data requirements. On the other hand, PromptIQA is trained on a mixed dataset with two proposed data augmentation strategies to learn diverse requirements, thus enabling it to effectively adapt to new requirements. Experiments indicate that the PromptIQA outperforms SOTA methods with higher performance and better generalization. The code will be available.

# 6

Strong Double Blind

An Explainable Vision Question Answer Model via Diffusion Chain-of-Thought

Chunhao LU · Qiang Lu · Jake Luo

Explainable visual question-answering research focuses on generating explanations for answers. However, in complex VQA scenarios, there can be a significant semantic distance between the question and the answer. This means that generating explanations solely for the answer can lead to a semantic discrepancy between the content of the explanation and the question-answering content. To address this, we propose a step-by-step reasoning approach to reduce such semantic discrepancies. Additionally, the task of explaining VQA should include generating explanations for the reasoning steps to obtain explanations for the final answer. We introduce a diffusion chain-of-thought model to implement this step-by-step reasoning and the explanation process. The model consists of two processes: the external diffusion and the internal diffusion. The external diffusion process generates explanations for each reasoning step, while the internal diffusion process describes the probability of the question transitioning to each step of the explanation. Through experiments on eight sub-tasks in the ScienceQA dataset, we demonstrate that our diffusion chain-of-thought model outperforms GPT-3.5 in terms of the answer accuracy and the explanation ability while only using 1% of GPT-3.5's parameters. Furthermore, the model approaches GPT-4, Llama, and so on in eight sub-tasks.

# 8

Fully Authentic Visual Question Answering Dataset from Online Communities

Chongyan Chen · Mengchen Liu · Noel C Codella · Yunsheng Li · Lu Yuan · Danna Gurari

Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We characterize this dataset and how it relates to eight mainstream VQA datasets. Observing that answers in our dataset tend to be much longer (i.e., a mean of 173 words) and so incompatible with standard VQA evaluation metrics, we instead utilize popular metrics for longer text evaluation for evaluating six state-of-the-art VQA models on VQAonline and report where they struggle most. Finally, we analyze which evaluation metrics align best with human judgments. To facilitate future extensions, we publicly-share the dataset at: https://placeholder.com.

# 7

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Guohao Sun · Can Qin · JIAMINAN WANG · Zeyuan Chen · Ran Xu · Zhiqiang Tao

Recent advancements in the vision-language model have shown notable generalization in vision-language tasks after visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which are costly to obtain. However, the image contains rich contextual information that has been largely under-explored. This paper first attempts to harness this overlooked context within visual instruction data, training the model to self-supervised `learning' how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a consistent performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.

# 22

Strong Double Blind

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

Yifeng Zhang · Ming Jiang · Qi Zhao

Despite the remarkable success of large vision-language models (LVLMs) on various tasks, their susceptibility to knowledge bias inherited from training data hinders their ability to generalize to new scenarios and limits their real-world applicability. To address this challenge, we propose the Counterfactual Bias-Robust Reasoning (CoBRa) dataset that tackles knowledge bias by offering a novel collection of VQA examples designed to evaluate and mitigate bias in LVLMs. These examples encourage counterfactual thinking by providing edited knowledge graphs and image contents, with detailed annotations of reasoning processes to facilitate a comprehensive understanding of the examples. Based on the dataset, we introduce a Chain of Counterfactual Thought (CoCT) method that learns the bias-robust reasoning processes and provides in-context examples demonstrating how existing reasoning generalizes to counterfactual scenarios. This enables LVLMs to explicitly reason step-by-step rather than relying on biased knowledge, leading to more generalizable solutions. Our extensive evaluation demonstrates that CoCT outperforms existing approaches on tasks requiring reasoning under knowledge bias. Our work is available at https://shorturl.at/GOR45.

# 10

Strong Double Blind

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Ye-Bin Moon · Nam Hyeon-Woo · Wonseok Choi · Tae-Hyun Oh

Large vision language models (LVLMs) perceive the world through a combination of a visual encoder and large language models (LLMs). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and LLMs endow the high reasoning ability to LVLMs. It leads LVLMs to achieve high performance on wide benchmarks without fine-tuning, known as zero or few-shot capability of LLMs. However, recent studies show that LVLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from LVLMs. To enhance trustworthiness and better tackle the hallucination of LVLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is that we manipulate visual scene information by image editing models and design the metrics based on scene changes. This allows us to clearly assess whether LVLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize the correctness heatmap by virtue of our two-axis view: vision and text. Upon evaluating LVLMs with our dataset, we observed that our metrics can reveal different aspects of LVLM hallucination.

# 5

Strong Double Blind

Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs

Shi Liu · Kecheng Zheng · Wei Chen

Large Vision-Language Models (LVLMs) align image features to the input of Large Language Models (LLMs), enhancing multi-modal reasoning and knowledge utilization capabilities. However, the disparity in scale between models of different modalities has resulted in LLMs assuming a predominant role in multimodal comprehension. This imbalance in model integration can lead to instances of hallucinatory outputs. In particular, LVLMs may generate descriptions that persist in the absence of visual input, suggesting that these narratives are disproportionately influenced by the textual context. We refer to this phenomenon as ``text inertia.'' To counteract this issue, we introduce a training-free algorithm designed to find an equilibrium between image comprehension and language inference. Specifically, we firstly involve adjusting and amplifying the attention weights assigned to image tokens, thereby granting greater prominence to visual elements. Meanwhile, we subtract the logits of multimodal inputs from the model logits of pure text input, which can let model not biased towards only LLM. By enhancing images tokens and reducing the stubborn output of LLM, we can let LVLM pay more attention to images, towards alleviating text inertia and reducing the hallucination in LVLMs. Our extensive experiments shows that this method substantially reduces the frequency of hallucinatory outputs in various LVLMs in terms of different metrics.

# 9

Strong Double Blind

TrojVLM: Backdoor Attack Against Vision Language Models

Weimin Lyu · Lu Pang · Tengfei Ma · Haibin Ling · Chao Chen

The emergence of Vision Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to produce detailed text descriptions based on visual inputs, yet it introduces new security vulnerabilities. Unlike prior work that centered on single modalities or classification tasks, this study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation.Specifically, TrojVLM inserts predetermined target text into output text when encountering poisoned images. Moreover, a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of TrojVLM in maintaining original semantic content while triggering specific target text outputs. This study not only uncovers a critical security risk in VLMs and image-to-text generation but also sets a foundation for future research on securing multimodal models against such sophisticated threats.

# 11

Strong Double Blind

Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

Hunmin Yang · Jongoh Jeong · Kuk-Jin Yoon

Recent vision-language foundation models, such as CLIP, have demonstrated superior capabilities in learning representations that can be transferable across diverse range of downstream tasks and domains. With the emergence of such powerful models, it has become crucial to effectively leverage their capabilities in tackling challenging vision tasks. On the other hand, only a few works have focused on devising adversarial examples that transfer well to both unknown domains and model architectures. In this paper, we propose a novel transfer attack method called PDCL-Attack, which leverages CLIP to enhance the transferability of adversarial perturbations generated within a generative model-based attack framework. Specifically, we exploit the joint vision-language space to formulate an effective prompt-driven feature guidance by harnessing the semantic representation power of text, particularly from the input ground truth. To the best of our knowledge, we are the first to introduce prompt learning to enhance the transferable generative attacks. Extensive experiments conducted across various cross-domain and cross-model settings empirically validate our approach, demonstrating its superiority over state-of-the-art methods.

# 12

Strong Double Blind

Attention Prompting on Image for Large Vision-Language Models

Runpeng Yu · Weihao Yu · Xinchao Wang

Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs' capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models' ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image (API), which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, API improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.

# 17

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Dilxat Muhtar · Zhenshi Li · Feng Gu · Xueliang Zhang · Pengfeng Xiao

The revolutionary capabilities of large language models (LLMs) have paved the way for multimodal large language models (MLLMs) and fostered diverse applications across various specialized domains. In the remote sensing (RS) field, however, the diverse geographical landscapes and varied objects in RS imagery are not adequately considered in recent MLLM endeavors. To bridge this gap, we construct a large-scale RS image-text dataset, LHRS-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging the extensive volunteered geographic information (VGI) and globally available RS images. Building on this foundation, we introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method. Additionally, we introduce LHRS-Bench, a benchmark for thoroughly evaluating MLLMs’ abilities in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain.

# 320

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

KUNPENG SONG · Yizhe Zhu · Bingchen Liu · Qing Yan · Ahmed Elgammal · Xiao Yang

In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. We commit to making our work open-source, thereby providing universal access to these advancements.

# 28

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Bu Jin · Yupeng Zheng · Pengfei Li · Weize Li · Yuhang Zheng · Sujie Hu · Xinyu Liu · Jinwei Zhu · Zhijie Yan · Haiyang Sun · Kun Zhan · Peng Jia · Xiaoxiao Long · Yilun Chen · HAO ZHAO

3D dense captioning stands as a cornerstone in comprehensively scene understanding by explicit natural language, it has seen remarkable achievements recently in indoor scenes. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1. the domain gap between indoor and outdoor scenes, such as sparse visual inputs and dynamics, making it difficult to directly transfer existing methods; 2. the lack of data with descriptive 3D-Language pair annotations specifically tailored for outdoor scenes. Hence, we introduce the new task of outdoor 3D dense captioning. As input, we assume a point cloud of a LiDAR swept 3D scene along with a set of RGB images captured by ego-camera. To address this task, we propose TOD^3Cap network, leveraging the BEV representation to encode sparse outdoor scenes, and then combine Relation Q-Former with LLaMA-Adapter to capture spatial relationships and generate rich concept descriptions in the open-world outdoor environment. We also introduce the TOD^3Cap dataset, the first million-scale effort to jointly perform 3D object detection and captioning in outdoor scenes, containing 2.3M descriptions of 64.3k outdoor objects from 850 scenes in nuScenes. Notably, ours TOD^3Cap network can effectively localize and describe 3D objects in outdoor, which outperforms indoor baseline methods by a significant margin (+9.76\% CiDEr@0.5IoU)

# 327

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Young Kyun Jang · Dat B Huynh · Ashish Shah · Wen-Kai Chen · Ser-Nam Lim

Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.

# 257

Strong Double Blind

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

Ada-Astrid Balauca · Danda Paudel · Kristina Toutanova · Luc Van Gool

CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured -- in the form of tabular data -- visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP's powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code will be made publicly available.

# 39

Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

Wentao Bao · Lichang Chen · Heng Huang · Yu Kong

Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts, e.g., sliced tomatoes, where the model is learned only from the seen compositions, e.g., sliced potatoes and red tomatoes. Thanks to the prompt tuning on large pre-trained visual language models such as CLIP, recent literature shows impressively better CZSL performance than traditional vision-based methods. However, the key aspects that impact the generalization to unseen compositions, including the diversity and informativeness of class context, and the entanglement between visual primitives, i.e., state and object, are not properly addressed in existing CLIP-based CZSL literature. In this paper, we propose a model by prompting the language-informed distribution, aka., PLID, for the CZSL task. Specifically, the PLID leverages pre-trained large language models (LLM) to 1) formulate the language-informed class distributions which are diverse and informative, and 2) enhance the compositionality of the class embedding. Moreover, a visual-language primitive decomposition (VLPD) module and a stochastic logit mixup (SLM) strategy are proposed to dynamically fuse the decisions from the compositional and the primitive logit space. Orthogonal to the existing literature of soft, hard, or distributional prompts, our method advocates prompting the LLM-supported class distribution that leads to a better zero-shot generalization. Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts. Our code and models will be released after acceptance.

# 15

Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

Qiaomu Miao · Alexandros Graikos · Jingwei Zhang · Sounak Mondal · Minh Hoai · Dimitris Samaras

Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by `prompting' the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the useful VQA prior information while exhibiting similar properties to the training data distribution. Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset. More importantly, our pseudo-annotation strategy, applied to a widely used supervised gaze following model (VAT), reduces the annotation need by 50%. Our method also performs the best on the VideoAttentionTarget dataset. Our source code will be made available upon publication.

# 29

Strong Double Blind

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

Jiedong Zhuang · Jiaqi Hu · Lianrui Mu · Rui Hu · Xiaoyu Liang · Jiangnan Ye · Haoji Hu

CLIP has achieved impressive zero-shot performance after pretraining on a large-scale dataset consisting of paired image-text data. Previous works have utilized CLIP by incorporating manually designed visual prompts like colored circles and blur masks into the images to guide the model's attention, showing enhanced zero-shot performance in downstream tasks. Although these methods have achieved promising results, they inevitably alter the original information of the images, which can lead to failure in specific tasks. We propose a train-free method Foveal-Attention CLIP (FALIP), which adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module. We demonstrate FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition. Experimental results further show that FALIP outperforms existing methods on most metrics and can augment current methods to enhance their performance.

# 145

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

Donggyun Kim · Seongwoong Cho · Semin Kim · Chong Luo · Seunghoon Hong

Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training. However, constructing a data-efficient generalist for dense visual prediction presents a distinct challenge due to the variation in label structures across different tasks. Consequently, generalization to unseen dense prediction tasks in the low-data regime is not straightforward and has received less attention from previous vision generalists. In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples, enabling it to serve as a data-efficient vision generalist in diverse real-world scenarios. To this end, we base our method on a powerful meta-learning framework and explore several axes to improve its performance and versatility for real-world problems, such as flexible adaptation mechanisms and scalability. We evaluate our model across a spectrum of unseen real-world scenarios where low-shot learning is desirable, including video, 3D, medical, biological, and user-interactive tasks. Equipped with a generic architecture and an effective adaptation mechanism, our model flexibly adapts to all of these tasks with at most 50 labeled images, showcasing a significant advancement over existing data-efficient generalist approaches.

# 47

Strong Double Blind

Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection

Zihan Zhang · Zhuo Xu · Xiang Xiang

Recent vision-language models (VLMs) such as CLIP have shown promise in Out-of-distribution (OOD) detection through their generalizable multimodal representations. Existing CLIP-based OOD detection methods only utilize a single modality of in-distribution (ID) information (e.g., textual cues). However, we find that the ID visual information helps to leverage CLIP's full potential for OOD detection. In this paper, we pursue a different approach and explore the regime to leverage both the visual and textual ID information. Specifically, we propose Dual-Pattern Matching (DPM), efficiently adapting CLIP for OOD detection by leveraging both textual and visual ID patterns. DPM stores ID class-wise text features as the textual pattern and the aggregated ID visual information as the visual pattern. At test time, the similarity to both patterns is computed to detect OOD inputs. We further extend DPM with lightweight adaptation for enhanced OOD detection. Experiments demonstrate DPM's advantages, outperforming existing methods on common benchmarks. The dual-pattern approach provides a simple yet effective way to exploit multi-modality for OOD detection with vision-language representations.

# 60

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

Qing Jiang · Feng Li · Zhaoyang Zeng · Shilong Liu · Tianhe Ren · Lei Zhang

We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection.

# 80

Strong Double Blind

Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation

Jaehyeong Jeon · Kibum Kim · Kanghoon Yoon · Chanyoung Park

The scene graph generation (SGG) task involves detecting objects within an image and predicting predicates that represent the relationships between the objects. However, as each subject-object pair in SGG benchmark datasets is annotated with a single predicates even though a single predicate may exhibit diverse semantics (i.e., semantic diversity), existing SGG models are trained to predict the one and only predicate annotated between each subject-object pair. This in turn results in the SGG models to overlook the semantic diversity that may exist in a predicate, thus leading to biased predictions. In this paper, we propose a novel model-agnostic Semantic Diversity-aware Prototype-based Learning (DPL) framework that enables unbiased predictions based on the understanding of the semantic diversity of predicates. Specifically, DPL learns the regions in the semantic space covered by each predicate to distinguish among the various different semantics that a single predicate can represent. Extensive experiments demonstrate that our proposed model-agnostic DPL framework brings significant performance improvement on existing SGG models, and also effectively understands the semantic diversity of predicates.

# 61

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

Hu Zhang · xu jianhua · Tao Tang · Haiyang Sun · Xin Yu · Zi Helen Huang · Kaicheng Yu

Traditional LiDAR-based object detection research primarily focuses on closed-set scenarios, which falls short in complex real-world applications. Directly transferring existing 2D open-vocabulary models with some known LiDAR classes for open-vocabulary ability, however, tends to suffer from over-fitting problems: The obtained model will detect the known objects, even presented with a novel category. In this paper, we propose OpenSight, a more advanced 2D-3D modeling framework for LiDAR-based open-vocabulary detection. OpenSight utilizes 2D-3D geometric priors for the initial discernment and localization of generic objects, followed by a more specific semantic interpretation of the detected objects. The process begins by generating 2D boxes for generic objects from the accompanying camera images of LiDAR. These 2D boxes, together with LiDAR points, are then lifted back into the LiDAR space to estimate corresponding 3D boxes. For better generic object perception, our framework integrates both temporal and spatial-aware constraints. Temporal awareness correlates the predicted 3D boxes across consecutive timestamps, recalibrating the missed or inaccurate boxes. The spatial awareness randomly places some ``precisely'' estimated 3D boxes at varying distances, increasing the visibility of generic objects. To interpret the specific semantics of detected objects, we develop a cross-modal alignment and fusion module to first align 3D features with 2D image embeddings and then fuse the aligned 3D-2D features for semantic decoding. Our experiments indicate that our method establishes state-of-the-art open-vocabulary performance on widely used 3D detection benchmarks and effectively identifies objects for new categories of interest.

# 58

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Muer Tie · Julong Wei · Zhengjun Wang · Ke Wu · Shanshuai Yuan · Kaizhao Zhang · Jie Jia · Jieru Zhao · Zhongxue Gan · Wenchao Ding

Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-Mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-Mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.

# 23

Strong Double Blind

APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension

Yaxin Luo · Jiayi Ji · Xiaofu Chen · Yuxin Zhang · Tianhe Ren · Luo

Referring Expression Comprehension (REC) aims to ground the target object based on a given referring expression, which requires expensive instance-level annotations for training. To address this issue, recent advances explore an efficient one-stage weakly supervised REC model called RefCLIP. Particularly, RefCLIP utilizes anchor features of pre-trained one-stage detection networks to represent candidate objects and conducts anchor-text ranking to locate the referent. Despite the effectiveness, we identify that visual semantics of RefCLIP are ambiguous and insufficient for weakly supervised REC modeling. To address this issue, we propose a novel method that enriches visual semantics with various prompt information, called anchor-based prompt learning (APL). Specifically, APL contains an innovative anchor-based prompt encoder (APE) to produce discriminative prompts covering three aspects of REC modeling, e.g., position, color and category. These prompts are dynamically fused into anchor features to improve the visual description power. In addition, we propose two novel auxiliary objectives to achieve accurate vision-language alignment in APL, namely text reconstruction loss and visual alignment loss. To validate APL, we conduct extensive experiments on four REC benchmarks, namely RefCOCO, RefCOCO+, RefCOCOg and ReferIt. Experimental results not only show the state-of-the-art performance of APL against existing methods on four benchmarks, e.g., +6.44% over RefCLIP on RefCOCO, but also confirm its strong generalization ability on weakly supervised referring expression segmentation. Source codes are anonymously released at: https://anonymous.4open.science/r/APL-B297.

# 34

Strong Double Blind

GTMS: A Gradient-driven Tree-guided Mask-free Referring Image Segmentation Method

Haoxin Lyu · Tianxiong Zhong · Sanyuan Zhao

Referring image segmentation (RIS) aims to segment an object of interest by a given natural language expression. As fully-supervised methods require expensive pixel-wise labeling, mask-free solutions supervised by low-cost labels are largely desired. However, existing mask-free methods suffer from complicated architectures or unsatisfying performance. In this paper, we propose a gradient-driven tree-guided mask-free referring image segmentation method, GTMS, which utilizes both low-level structural information and high-level semantic information, while only using a bounding box as the supervised signal. Specifically, we first mind the structural information using a tree filter from the low-level feature. Meanwhile, we explore semantic attention by GradCAM from the high-level feature. Finally, the tree structure and attention information are used to refine the output of the segmentation model to generate pseudo labels, which in turn are used to optimize the model. To verify the effectiveness of our model, the experiments are conducted on three benchmarks, \textit{i.e.}, RefCOCO/+/g. Notably, it achieves 66.54\%, 69.98\%, and 63.41\% IoU on RefCOCO Val-Test, TestA, and TestB, outperforming most of the fully-supervised models.

# 69

Strong Double Blind

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

Baijiong Lin · Weisen Jiang · Pengguang Chen · Yu Zhang · Shu Liu · Yingcong Chen

Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. Source code will be made available upon acceptance.

# 55

Strong Double Blind

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Mengcheng Lan · Chaofeng Chen · Yiping Ke · Xinjiang Wang · Litong Feng · Wayne Zhang

Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a \emph{training-free} approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks \textbf{from 40.3 to 44.4}, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.

# 46

Strong Double Blind

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Anurag Das · Xinting Hu · Li Jiang · Bernt Schiele

Recent approaches have shown that large-scale vision-language models such as CLIP can improve semantic segmentation performance. These methods typically aim for pixel-level vision language alignment, but often rely on low-resolution image features from CLIP, resulting in class ambiguities along boundaries. Moreover, the global scene representations in CLIP text embeddings do not directly correlate with the local and detailed pixel-level features, making meaningful alignment more difficult. To address these limitations, we introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment. Specifically, we first propose Mask-Text Decoder that enhances the mask representations using rich textual data with the CLIP language model. Subsequently, it aligns mask representations with text embeddings using Mask-to-Text Contrastive Learning. Furthermore, we introduce Mask-Text Prompt Learning, utilizing multiple context-specific prompts for text embeddings to capture diverse class representations across masks. Overall, MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on standard benchmark datasets, ADE20k and Cityscapes, respectively.

# 43

Strong Double Blind

Think before Placement: Common Sense Enhanced Transformer for Object Placement

Yaxuan Qin · Jiayu Xu · Ruiping Wang · Xilin CHEN

Object placement is a task to insert a foreground object into a background scene at a suitable position and size. Existing methods mainly focus on extracting better visual features, while neglecting common sense about the objects and background. It leads to semantically unrealistic object positions. In this paper, we introduce Think Before Placement, a novel framework that effectively combines the implicit and explicit knowledge to generate placements that are both visually coherent and contextually appropriate. Specifically, we first adopt a large multi-modal model to generate a descriptive caption of the background (Think), then output proper position and size of the object (Place). The caption serves as an explicit semantic guidance for the subsequent placement of objects. Using this framework, we implement our model named CSENet, which outperforms baseline methods on the OPA dataset in extensive experiments. Further, we establish the OPAZ dataset to evaluate the zero-shot transfer capabilities of CSENet, where it also shows impressive zero-shot performance across different foreground objects and scenes.

# 77

Strong Double Blind

Eliminating Feature Ambiguity for Few-Shot Segmentation

Qianxiong Xu · Guosheng Lin · Chen Change Loy · Cheng Long · Ziyue Li · Rui Zhao

Recent advancements in few-shot segmentation (FSS) have exploited pixel-by-pixel matching between query and support features, typically based on cross attention, which selectively activate query foreground (FG) features that correspond to the same-class support FG features. However, due to the large receptive fields in deep layers of the backbone, the extracted query and support FG features are inevitably mingled with different BG features, impeding the FG-FG matching in cross attention. Hence, the query FG features are fused with less support FG features, i.e., the support information is not well utilized. This paper presents a novel plug-in termed ambiguity elimination network (AENet), which can be plugged into any existing cross attention-based FSS methods. The main idea is to mine discriminative query FG regions to rectify the ambiguous FG features, increasing the proportion of FG information, so as to suppress the negative impacts of the doped BG features. In this way, the FG-FG matching is naturally enhanced. We plug AENet into two baselines CyCTR and SCCAN for evaluation, and their scores are improved by large margins, e.g., the 1-shot performance of SCCAN can be improved by 3.0%+ on both PASCAL-5i and COCO-20i. The source code will be released upon paper acceptance. \keywords{Discriminative prior mask \and Discriminative query regions \and Feature refinement}

# 76

Strong Double Blind

Diffusion-Guided Weakly Supervised Semantic Segmentation

Sung-Hoon Yoon · Hoyong Kwon · Jaeseok Jeong · Daehee Park · Kuk-Jin Yoon

Weakly Supervised Semantic Segmentation (WSSS) with image-level supervision typically uses Class Activation Maps to localize the object based on Convolutional Neural Networks (CNN). With limited receptive fields, CNN-based CAMs often fail to localize the whole object. The emergence of a Vision Transformer (ViT) alleviates the problem with superior performance, but the lack of locality in ViT introduces a new challenge. Inspired by the ability of Denoising Diffusion Probabilistic Models (DDPM) to capture high-level semantic information, we bring diffusion models to WSSS to resolve the problem. Firstly, to fuse and semantically align the information between DDPM and ViT, we design the Locality Fusion Cross Attention (LFCA) module. Using the aggregated features from the denoising process of the pretrained DDPM, LFCA generates CAMs (Diffusion-CAMs) that provide locality information to CAMs from ViT (ViT-CAMs). Secondly, by adding noise to the original image and denoising it with DDPM, we obtain a denoised image that can be leveraged as an augmented sample. To effectively guide ViT in excavating the relation between the patches, we devise the Patch Affinity Consistency (PAC) between the outputs of the original image and the denoised image. Extensive ablation studies support the superiority of the proposed method. Our method achieves new state-of-the-art performance on two widely used datasets in WSSS; PASCAL VOC 2012 and MS-COCO 2014. The code will soon be released.

# 63

Strong Double Blind

Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

Jeongkee Lim · Yusung Kim

The challenge of semantic segmentation in Unsupervised Domain Adaptation (UDA) emerges not only from domain shifts between source and target images but also from discrepancies in class taxonomies across domains. Traditional UDA research assumes consistent taxonomy between the source and target domains, thereby limiting their ability to recognize and adapt to the taxonomy of the target domain. This paper introduces a novel approach, Cross-Doamin Semantic Segmentation on Inconsistent Taxonomy using Vision Language Models (CSI), which effectively performs domain-adaptive semantic segmentation even in situations of source-target class mismatches. CSI leverages the semantic generalization potential of Visual Language Models (VLMs) to create synergy with previous UDA methods. It utilizes segment reasoning obtained through traditional UDA methods, alongside the rich semantic knowledge embedded in VLMs, to perform relabeling to classes of the target domain. This approach allows for effective adaptation to changed taxonomies without requiring any ground truth label for the target domain. Our method has shown to be effective across various benchmarks in situations of inconsistent taxonomy settings, such as coarse-to-fine taxonomy and open taxonomy, and demonstrates consistent synergy effects when integrated with previous state-of-the-art UDA methods.

# 62

Better Call SAL: Towards Learning to Segment Anything in Lidar

Aljoša Ošep · Tim Meinhardt · Francesco Ferroni · Neehar Peri · Deva Ramanan · Laura Leal-Taixe

We propose SAL (Segment Anything in Lidar), a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm to Lidar Panoptic Segmentation (LPS) relies on manual supervision for a handful of object classes, defined a priori, we lean on Vision Foundation Models to generate supervision ``for free'' in the form of instance masks and corresponding localized text embeddings, which we distill to Lidar using calibrated multi-modal data. Even though our model is solely trained using self-generated pseudo-labels, SAL achieves 91% of the supervised model performance in terms of class-agnostic segmentation and 44% in terms of zero-shot LPS on standard LPS datasets, and outperforms baselines that directly lift image features to 3D. More importantly, we show SAL supports arbitrary class prompts, can be easily extended new datasets, and shows a significant potential to improve with increased amount of self-labeled data.

# 84

Strong Double Blind

MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Linyan Yang · Lukas Hoyer · Mark Weber · Tobias Fischer · Dengxin Dai · Laura Leal-Taixe · Daniel Cremers · Marc Pollefeys · Luc Van Gool

Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances. The source code will be released with the paper.

# 74

DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation

Sanghyun Jo · Fei Pan · In-Jae Yu · Kyungsu Kim

Weakly-supervised semantic segmentation (WSS) ensures high-quality segmentation with limited data and excels when employed as input seed masks for large-scale vision models such as Segment Anything. However, WSS faces challenges related to minor classes since those are overlooked in images with adjacent multiple classes, a limitation originating from the overfitting of traditional expansion methods like Random Walk. We first address this by employing unsupervised and weakly-supervised feature maps instead of conventional methodologies, allowing for hierarchical mask enhancement. This method distinctly categorizes higher-level classes and subsequently separates their associated lower-level classes, ensuring all classes are correctly restored in the mask without losing minor ones. Our approach, validated through extensive experimentation, significantly improves WSS across five benchmarks (VOC: 79.8%, COCO: 53.9%, Context: 49.0%, ADE: 32.9%, Stuff: 37.4%), reducing the gap with fully supervised methods by over 84% on the VOC validation set. Code will be available at to-be-updated.

# 71

Strong Double Blind

Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation

Anqi Zhang · Guangyu Gao

Class Incremental Semantic Segmentation~(CISS), within Incremental Learning for semantic segmentation, targets segmenting new categories while reducing the catastrophic forgetting on the old categories.Besides, background shifting, where the background category changes constantly in each step, is a special challenge for CISS. Current methods with a shared background classifier struggle to keep up with these changes, leading to decreased stability in background predictions and reduced accuracy of segmentation. For this special challenge, we designed a novel background adaptation mechanism, which explicitly models the background residual rather than the background itself in each step, and aggregates these residuals to represent the evolving background. Therefore, the background adaptation mechanism ensures the stability of previous background classifiers, while enabling the model to concentrate on the easy-learned residuals from the additional channel, which enhances background discernment for better prediction of novel categories. To precisely optimize the background adaptation mechanism, we propose Pseudo Background Binary Cross-Entropy loss and Background Adaptation losses, which amplify the adaptation effect. Group Knowledge Distillation and Background Feature Distillation strategies are designed to prevent forgetting old categories. Our approach, evaluated across various incremental scenarios on Pascal VOC 2012 and ADE20K datasets, outperforms prior exemplar-free state-of-the-art methods with mIoU of 3.0% in VOC 10-1 and 2.0% in ADE 100-5, notably enhancing the accuracy of new classes while mitigating catastrophic forgetting. Code is available in supplementary material.

# 72

Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models

Francesco Croce · Naman D. Singh · Matthias Hein

Adversarial robustness has been studied extensively in image classification, especially for the $\ell_\infty$-threat model, but significantly less so for related tasks such as object detection and semantic segmentation. Attacks on semantic segmentation models turn out to be harder than for image classification. We propose novel attacks and motivated by their complementary properties, we put them into an attack ensemble called SEA. We use SEA to show that existing attacks can severely overestimate the robustness of semantic segmentation models. Perhaps surprisingly, existing attempts of adversarial training for semantic segmentation turn out to yield only weakly robust models or are even completely non-robust. We investigate why previous adaptations of adversarial training to semantic segmentation failed and identify insufficient training time and number of attack steps as key elements. In turn we show how recently proposed robust ImageNet backbones can be used to obtain adversarially robust semantic segmentation models with up to six times less training time for Pascal VOC and the more challenging ADE20k.

# 226

ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition

Tianhao Wu · Chuanxia Zheng · Qianyi Wu · Tat-Jen Cham

3D decomposition/segmentation still remains a challenge as large-scale 3D annotated data is not readily available. Contemporary approaches typically leverage 2D machine-generated segments, integrating them for 3D consistency. While the majority of these methods are based on NeRFs, they face a potential weakness that the instance/semantic embedding features derive from independent MLPs, thus preventing the segmentation network from learning the geometric details of the objects directly through radiance and density. In this paper, we propose ClusteringSDF, a novel approach to achieve both segmentation and reconstruction in 3D via the neural implicit surface representation, specifically Signal Distance Function (SDF), where the segmentation rendering is directly integrated with the volume rendering of neural implicit surfaces. Although based on ObjectSDF++, ClusteringSDF no longer requires the ground-truth segments for supervision while maintaining the capability of reconstructing individual object surfaces, but purely with the noisy and inconsistent labels from pre-trained models.As the core of ClusteringSDF, we introduce a high-efficient clustering mechanism for lifting the 2D labels to 3D and the experimental results on the challenging scenes from ScanNet and Replica datasets show that ClusteringSDF can achieve competitive performance compared against the state-of-the-art with significantly reduced training time.

# 204

Strong Double Blind

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

Jianhao Li · Tianyu Sun · Zhongdao Wang · Enze Xie · Bailan Feng · Hongbo Zhang · Ze Yuan · Ke Xu · Jiaheng Liu · Ping Luo

This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a \emph{Segment, Lift, and Fit} (SLF) paradigm to achieve this goal. Firstly, we \emph{segment} high-quality instance masks from the prompts using the Segment Anything Model ({SAM}) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then \emph{lift} 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections \emph{fit} the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IOU of near 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects.

# 167

Strong Double Blind

EcoMatcher: Efficient Clustering Oriented Matcher for Detector-free Image Matching

Peiqi Chen · Lei Yu · Yi Wan · Yongjun Zhang · Jian Wang · Liheng Zhong · Jingdong Chen · Ming Yang

Detector-free local feature matching methods have demonstrated significant performance improvements since leveraging the power of Transformer architecture. The global receptive field provided by Transformers allows for simultaneous interaction among all elements, proving particularly beneficial in regions with low texture or repetitive patterns. However, Transformer-based methods encounter the bottleneck in achieving a balance between computational cost and expressive efficacy when dealing with numerous patch-level features. In this work, we revisit the existing detector-free methods and propose EcoMatcher, a universal matcher based on implicit clustering, called Context Clusters. By introducing coarser-grained features as clustering centers, similar patch-level features are allocated to the same center, forming different clustering patterns. Features within the same cluster are then dispatched with identical messages from their center but at varying scales depending on similarity. This process defines a novel feature extraction paradigm for both self-understanding and cross-interaction of image pairs, aiding in fusing multi-level features and reducing the overall complexity. EcoMatcher is a competitive detector-free method in terms of memory demand and runtime speed, and also achieves strong performance on both indoor and outdoor mainstream benchmarks.

# 65

Strong Double Blind

Class-Agnostic Object Counting with Text-to-Image Diffusion Model

Xiaofei Hui · Qian Wu · Hossein Rahmani · Jun Liu

Class-agnostic object counting aims to count objects of arbitrary classes with limited information (e.g., a few exemplars or the class names) provided. It requires the model to effectively acquire the characteristics of the target objects and accurately perform counting, which can be challenging. In this work, inspired by that text-to-image diffusion models hold rich knowledge and comprehensive understanding of real-world objects, we propose to leverage the pre-trained text-to-image diffusion model to facilitate class-agnostic object counting. Specifically, we propose a novel framework named CountDiff with careful designs, leveraging the pre-trained diffusion model's comprehensive understanding of image contents to perform class-agnostic object counting. The experiments show the effectiveness of CountDiff on both few-shot setting with exemplars provided and zero-shot setting with class names provided.

# 64

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Yuqian Fu · Yu Wang · Yixuan Pan · Xingyu Qiu · Lian Huai · Zeyu Shangguan · Tong Liu · Yanwei Fu · Luc Van Gool · Xingqun Jiang

This paper studies the challenging cross-domain few-shot object detection (CD-FSOD), aiming to develop an accurate object detector for novel domains with minimal labeled examples. While transformer-based open-set detectors, such as DE-ViT, show promise in traditional few-shot object detection, their generalization to CD-FSOD remains unclear: 1) can such open-set detection methods easily generalize to CD-FSOD? 2) If not, how can models be enhanced when facing huge domain gaps? To answer the first question, we employ measures including style, inter-class variance (ICV), and indefinable boundaries (IB) to understand the domain gap. Based on these measures, we establish a new benchmark named CD-FSOD to evaluate object detection methods, revealing that most of the current approaches fail to generalize across domains. Technically, we observe that the performance decline is associated with our proposed measures: style, ICV, and IB. Consequently, we propose several novel modules to address these issues. First, the learnable instance features align initial fixed instances with target categories, enhancing feature distinctiveness. Second, the instance reweighting module assigns higher importance to high-quality instances with slight IB. Third, the domain prompter encourages features resilient to different styles by synthesizing imaginary domains without altering semantic contents. These techniques collectively contribute to the development of the Cross-Domain Vision Transformer for CD-FSOD (CD-ViTO), significantly improving upon the base DE-ViT. Experimental results validate the efficacy of our model. All datasets, codes, and models will be released to the community.

# 82

Strong Double Blind

Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection

Lianjun Wu · Jiangxiao Han · Zengqiang Zheng · Xinggang Wang

The Sparsely Annotated Object Detection (SAOD) tackles the issue of incomplete labeling in object detection. Compared with Fully Annotated Object Detection (FAOD), SAOD is more complicated and challenging. Unlabeled objects tend to provide wrong supervision to the detectors during training, resulting in inferior performance for prevalent object detectors. Shrinking the performance gap between SAOD and FAOD does contribute to reducing the labeling cost. Existing methods tend to exploit pseudo-labeling for unlabeled objects while suffering from two issues: (1) they fail to make full use of unlabeled objects mined from the student detector and (2) the pseudo-labels contain much noise. To tackle those two issues, we introduce Co-Student, a novel framework aiming to bridge the gap between SAOD and FAOD via fully exploiting the pseudo-labels from both teacher and student detectors. The proposed Co-Student comprises a sophisticated teacher to denoise the pseudo-labels for unlabeled objects and two collaborative students that leverage strong and weak augmentations to excavate pseudo-labels. The students exchange the denoised pseudo-labels and learn from each other with consistency regularization brought by strong-weak augmentations. Without bells and whistles, the proposed Co-Student framework with the one-stage detector, i.e. FCOS, can achieve state-of-the-art performance on the COCO dataset with sparse annotations under diverse settings. Compared to previous works, it obtains 1.0%~3.0% AP improvements under five settings of sparse annotations and achieves 95.1% performance compared to FCOS trained on fully annotated COCO dataset. Code and models will be made publicly available for further research on SAOD.

# 70

Strong Double Blind

Plain-Det: A Plain Multi-Dataset Object Detector

cheng Shi · yuchen zhu · Sibei Yang

Recent advancements in large-scale foundational models have sparked widespread interest in training highly proficient large vision models. A common consensus revolves around the necessity of aggregating extensive, high-quality annotated data. However, given the inherent challenges in annotating dense tasks in computer vision, such as object detection and segmentation, a practical strategy is to combine and leverage all available data for training purposes. In this work, we propose Plain-Det, which offers flexibility to accommodate new datasets, robustness in performance across diverse datasets, training efficiency, and compatibility with various detection architectures. We utilize Def-DETR, with the assistance of Plain-Det, to achieve a mAP of 51.9 on COCO, matching the current state-of-the-art detectors. We conduct extensive experiments on 13 downstream datasets and Plain-Det demonstrates strong generalization capability. Code will be made publicly available.

# 79

Strong Double Blind

Multi-scale Cross Distillation for Object Detection in Aerial Images

Kun Wang · Zi Wang · Zhang Li · Xichao Teng · Yang Li

Object detection in aerial images is a longstanding yet challenging task. Despite the significant advancements in recent years, most works still show unsatisfactory performance due to the scale variation of objects. A standard strategy to address this problem is multi-scale training, aiming at learning scale-invariant feature representations. Albeit achieving inspiring improvements, such a multi-scale strategy is impractical for real application as inference time increases considerably. Besides, the original images are resized to different scales and subsequently trained separately, lacking information interaction across multi-scales. In this paper, we present a novel method called multi-scale cross distillation (MSCD) to address the above-mentioned issues. MSCD combines the merits of multi-scale training and knowledge distillation, enabling single-scale inference to achieve comparable or superior performance than multi-scale inference. Specifically, we first construct a parallel multi-branch architecture, in which each branch shares the same parameters yet takes images with different scales as input. Furthermore, we design an adaptive cross-scale distillation module that adaptively integrates the knowledge of different branches into a single one. Thus, the detectors trained with MSCD only require single-scale inference. Extensive experiments demonstrate the effectiveness of MSCD. Without bells and whistles, MSCD can facilitate prevalent two-stage detectors to outperform corresponding single-scale models by ~5 mAP and ~7 mAP improvement on DOTA and DIOR-R datasets, respectively.

# 66

Strong Double Blind

PDT Uav Target Detection Dataset for Pests and Diseases Tree

Mingle Zhou · Rui Xing · Delong Han · Zhiyong Qi · Gang Li

UAVs emerge as the optimal carriers for visual weed identification and integrated pest and disease management in crops. However, the absence of specialized datasets impedes the advancement of model development in this domain. To address this, we have developed the Pests and Diseases Tree dataset (PDT dataset). PDT dataset represents the first high-precision UAV-based dataset for targeted detection of tree pests and diseases, which is collected in real-world operational environments and aims to fill the gap in available datasets for this field. Moreover, by aggregating public datasets and network data, we further introduce the Common Weed and Crop dataset (CWC dataset) to address the challenge of inadequate classification capabilities of test models within datasets for this field. Finally, we propose the YOLO-Dense Pest (DP) model for high-precision object detection of weed, pest, and disease crop images. We re-evaluate the state-of-the-art detection methods with our proposed PDT dataset and CWC dataset, showing the completeness of the dataset and the effectiveness of the YOLO-DP. The proposed PDT dataset, CWC dataset, and YOLO-DP method are presented at https://github.com/eccv-Anonymity/PDTCWCYOLO-DP. (Now it's an anonymous URL for review, and the datasets will be republished on the project home page upon acceptance.)

# 280

Region-Adaptive Transform with Segmentation Prior for Image Compression

Yuxi Liu · Wenhan Yang · Huihui Bai · Yunchao Wei · Yao Zhao

Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extracting region-adaptive contextual information. Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks. Additionally, we introduce a plug-and-play module named Scale Affine Layer to incorporate rich contexts from various regions. While there have been prior image compression efforts that involve segmentation masks as additional intermediate inputs, our approach differs significantly from them. Our advantages lie in that, to avoid extra bitrate overhead, we treat these masks as privilege information, which is accessible during the model training stage but not required during the inference phase. To the best of our knowledge, we are the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR). The experimental results demonstrate our improvement compared to previously well-performing methods, with about 8.2% bitrate saving compared to VTM-17.0. The code will be released.

# 103

Strong Double Blind

FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification

Yu Tian · Congcong Wen · Min Shi · Muhammad Muneeb Afzal · Hao Huang · Muhammad Osama Khan · Yan Luo · Yi Fang · Mengyu Wang

Addressing fairness in artificial intelligence (AI), particularly in medical AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to enhance fairness have introduced new methodologies and datasets in medical AI. However, the fairness issue under the setting of domain transfer is almost unexplored, while it is common that clinics rely on different imaging technologies (e.g., different retinal imaging modalities) for patient diagnosis. This paper presents FairDomain, a pioneering systemic study into algorithmic fairness under domain shifts, employing state-of-the-art domain adaptation (DA) and generalization (DG) algorithms for both medical segmentation and classification tasks to understand how biases are transferred between different domains. We also introduce a novel plug-and-play fair identity attention (FIA) module that adapts to various DA and DG algorithms to improve fairness by using self-attention to adjust feature importance based on demographic attributes. Additionally, we curate the first fairness-focused dataset with two paired imaging modalities for the same patient cohort on medical segmentation and classification tasks, to rigorously assess fairness in domain-shift scenarios. Excluding the confounding impact of demographic distribution variation between source and target domains will allow clearer quantification of the performance of domain transfer models. Our extensive evaluations reveal that the proposed FIA significantly enhances both model performance accounted for fairness across all domain shift settings (i.e., DA and DG) with respect to different demographics, which outperforms existing methods on both segmentation and classification. The code and data for this paper can be accessed at https://github.com/anonymous4science/FairDomain.

# 57

Strong Double Blind

CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation

Shreyank Narayana Gowda · David A Clifton

The Segment Anything Model (SAM) has achieved remarkable successes in the realm of natural image segmentation, but its deployment in the medical imaging sphere has encountered challenges. Specifically, the model struggles with medical images that feature low contrast, faint boundaries, intricate morphologies, and small-sized objects. To address these challenges and enhance SAM's performance in the medical domain, we introduce a comprehensive modification. Firstly, we incorporate a frozen Convolutional Neural Network (CNN) branch as an image encoder, which synergizes with SAM's original Vision Transformer (ViT) encoder through a novel cross-branch attention module. This integration bolsters the model's capability to capture local spatial information, which is often paramount in medical imagery. Moreover, to further optimize SAM for medical imaging, we introduce feature and position adapters within the ViT branch, refining the encoder's representations. We see that compared to current prompting strategies to fine-tune SAM for medical segmentation, the use text descriptions that serve as text prompts for SAM helps significantly improve the performance. Leveraging ChatGPT's natural language understanding capabilities, we generate prompts that offer contextual information and guidance to SAM, enabling it to better understand the nuances of medical images and improve its segmentation accuracy. Our method, in its entirety, represents a significant stride towards making universal image segmentation models more adaptable and efficient in the medical domain.

# 237

Strong Double Blind

Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model

Seonghui Min · Hyun-Jic Oh · Won-Ki Jeong

In multi-class histopathology nuclei analysis tasks, the lack of training data becomes a main bottleneck for the performance of learning-based methods. To tackle this challenge, previous methods have utilized generative models to increase data by generating synthetic samples. However, existing methods often overlook the importance of considering the context of biological tissues (e.g., shape, spatial layout, and tissue type) in the synthetic data. Moreover, while generative models have shown superior performance in synthesizing realistic histopathology images, none of the existing methods are capable of producing image-label pairs at the same time. In this paper, we introduce a novel framework for co-synthesizing histopathology nuclei images and paired semantic labels using a context-conditioned joint diffusion model. We propose conditioning of a diffusion model using nucleus centroid layouts with structure-related text prompts to incorporate spatial and structural context information into the generation targets. Moreover, we enhance the granularity of our synthesized semantic labels by generating instance-wise nuclei labels using distance maps synthesized concurrently in conjunction with the images and semantic labels. We demonstrate the effectiveness of our framework in generating high-quality samples on multi-institutional, multi-organ, and multi-modality datasets. Our synthetic data consistently outperforms existing augmentation methods in the downstream tasks of nuclei segmentation and classification.

# 98

Strong Double Blind

DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

Wenhui Zhu · Xiwen Chen · Peijie Qiu · Aristeidis Sotiras · Abolfazl Razi · Yalin Wang

Multiple instance learning (MIL) stands as a powerful approach in weakly supervised learning, regularly employed in histological whole slide image (WSI) classification for detecting lesions. However, existing MIL mainstream methods focus on modeling correlation between instances while overlooking the inherent diversity among instances. Meanwhile, few MIL methods aimed at diversity modeling have emerged, showing a performance gap with mainstream MIL methods and facing diversity limits due to computational constraints. To bridge this gap, we propose a novel MIL aggregation method based on diverse global representation (DGR-MIL), by modeling diversity among instances through a set of global vectors. As a result, the global vectors serve as a summary of all instances. First, we turn the instance correlation into the similarity between instance embeddings and the predefined global vectors through a cross-attention mechanism. This stems from the fact that similar instance embeddings typically would result in a higher correlation with a certain global vector. Second, we propose two mechanisms to enforce the diversity among the global vectors to be more descriptive of the entire bag: (i) positive instance alignment and (ii) a novel, efficient, and theoretically guaranteed diversification learning. Specifically, the positive instance alignment module encourages the global vectors to align with instances of interest center (e.g., tumor WSI/bag). To further diversify the global representations, we propose a novel diversity loss. The proposed model outperforms the state-of-the-art MIL aggregation models by a substantial margin on the CAMELYON-16 and the TCGA-lung cancer datasets. The source code will be released upon acceptance.

# 113

Strong Double Blind

Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network

Sukwon Yun · Jie Peng · Alexandro E Trevino · Chanyoung Park · Tianlong Chen

Recent advancements in graph-based approaches for multiplexed immunofluorescence (mIF) images have significantly propelled the field forward, offering deeper insights into patient-level phenotype prediction. However, current graph-based methodologies encounter two primary challenges: ① Cellular Heterogeneity, where existing approaches fail to adequately address the inductive biases inherent in graphs, particularly the homophily characteristic observed in cellular connectivity; and ② Scalability, where handling cellular graphs from high-dimensional images faces difficulties in managing a high number of cells. To overcome these limitations, we introduce m^2IF, a novel multiplex network framework designed to efficiently process mIF images. m^2IF innovatively constructs a multiplex network comprising two distinct layers: a Voronoi network for geometric information and a Cell-type network for capturing cell-wise homogeneity. This framework equips a scalable and efficient Graph Neural Network (GNN), capable of processing the entire graph during training. Furthermore, m^2IF integrates an interpretable attention module that autonomously identifies relevant layers for image classification. Extensive experiments on a real-world patient dataset from various institutions highlight m^2IF’s remarkable efficacy and efficiency, marking a significant advancement in mIF analysis. m^2IF not only addresses the prevalent challenges in graph-based ML for mIF images but also establishes a new benchmark for accuracy and scalability in the domain.

# 133

An Incremental Unified Framework for Small Defect Inspection

Jiaqi Tang · Hao Lu · Xiaogang Xu · Ruizheng Wu · Sixing Hu · Tong Zhang · Tsz Wa Cheng · Ming Ge · Ying-Cong Chen · Fugee Tsung

Artificial Intelligence (AI)-driven defect inspection is pivotal in industrial manufacturing. However, existing inspection systems are typically designed for specific industrial products and struggle with diverse product portfolios and evolving processes. Although some previous studies attempt to address object dynamics by storing embeddings in the reserved memory bank, these methods suffer from memory capacity limitations and object distribution conflicts. To tackle these issues, we propose the Incremental Unified Framework (IUF), which integrates incremental learning into a unified reconstruction-based detection method, thus eliminating the need for feature storage in the memory. Based on IUF, we introduce Object-Aware Self-Attention (OASA) to delineate distinct semantic boundaries. We also integrate Semantic Compression Loss (SCL) to optimize non-primary semantic space, enhancing network adaptability for new objects. Additionally, we prioritize retaining the features of established objects during weight updates. Demonstrating prowess in both image and pixel-level defect inspection, our approach achieves state-of-the-art performance, supporting dynamic and scalable industrial inspections. Our code will be released.

# 116

Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection

Jian Shi · Pengyi Zhang · Ni Zhang · Hakim Ghazzai · Peter Wonka

Medical imaging often contains critical fine-grained features, such as tumors or hemorrhages, crucial for diagnosis yet potentially too subtle for detection with conventional methods. In this paper, we introduce \textit{DIA}, dissolving is amplifying. DIA is a fine-grained anomaly detection framework for medical images. First, we introduce \textit{dissolving transformations}. We employ diffusion with a generative diffusion model as a dedicated feature-aware denoiser. Applying diffusion to medical images in a certain manner can remove or diminish fine-grained discriminative features. Second, we introduce an \textit{amplifying framework} based on contrastive learning to learn a semantically meaningful representation of medical images in a self-supervised manner, with a focus on fine-grained features. The amplifying framework contrasts additional pairs of images with and without dissolving transformations applied and thereby emphasizes the dissolved fine-grained features. DIA significantly improves the medical anomaly detection performance with around 18.40\% AUC boost against the baseline method and achieves an overall SOTA against other benchmark methods. Our code is available at \url{http://}.

# 123

Strong Double Blind

GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features

Luc Sträter · Mohammadreza Salehi · Efstratios Gavves · Cees Snoek · Yuki M Asano

In the domain of anomaly detection, methods often excel in either semantic or industrial benchmarks, rarely achieving cross-domain proficiency. In this paper, we present GeneralAD, an anomaly detection framework designed to operate in semantic, near-distribution, and industrial settings with minimal per-task adjustments. In our approach, we capitalize on the inherent design of Vision Transformers, which are trained on image patches, thereby ensuring that the last hidden states retain a patch-based structure. We propose a novel self-supervised anomaly generation module that employs straightforward operations like noise addition and shuffling to patch features to construct pseudo-abnormal samples. These features are fed to an attention-based discriminator, which is trained to score every patch in the image. With this, our method can both accurately identify anomalies at the image level and also generate interpretable anomaly maps. We extensively evaluated our approach on 10 benchmarks, achieving state-of-the-art results in 6 datasets and on-par performance in the remaining for both localization and detection tasks.

# 118

Strong Double Blind

MoEAD: A Parameter-efficient Model for Multi-class Anomaly Detection

Shiyuan Meng · Wenchao Meng · Qihang Zhou · Shizhong Li · Weiye Hou · Shibo He

Utilizing a unified model to detect multi-class anomalies is a promising solution to real-world anomaly detection. Despite their appeal, such models typically suffer from large model parameters and thus pose a challenge to their deployment on memory-constrained embedding devices. To address this challenge, this paper proposes a novel ViT-style multi-class detection approach named MoEAD, which can reduce the model size while simultaneously maintaining its detection performance. Our key insight is that the FFN layers within each stacked block (i.e., transformer blocks in ViT) mainly characterize the unique representations in these blocks, while the remaining components exhibit similar behaviors across different blocks. The finding motivates us to squeeze traditional stacked transformed blocks from N to a single block, and then incorporate Mixture of Experts (MoE) technology to adaptively select the FFN layer from an expert pool in every recursive round. This allows MoEAD to capture anomaly semantics step-by-step like ViT and choose the optimal representations for distinct class anomaly semantics, even though it shares parameters in all blocks with only one. Extensive experiments show that, compared to the state-of-the-art (SOTA) anomaly detection methods, MoEAD achieves a desirable trade-off between performance and memory consumption. It not only employs the smallest model parameters, has the fastest inference speed, but also obtains competitive detection performance.

# 135

Strong Double Blind

PQ-SAM: Post-training Quantization for Segment Anything Model

Xiaoyu Liu · Xin Ding · Lei Yu · Yuanyuan Xi · WEI LI · Zhijun Tu · jie hu · Hanting Chen · Baoqun YIN · Zhiwei Xiong

Segment anything model (SAM) is a promising prompt-guided vision foundation model to segment objects of interest. However, the extensive computational requirements of SAM have limited its applicability in resource-constraint edge devices. Post-training quantization (PTQ) is an effective potential for fast-deploying SAM. Nevertheless, SAM's billion-scale pretraining creates a highly asymmetric activation distribution with detrimental outliers in excessive channels, resulting in significant performance degradation of the low-bit PTQ. In this paper, we propose PQ-SAM, the first PTQ method customized for SAM. To achieve a quantization-friendly tensor-wise distribution, PQ-SAM incorporates a novel grouped activation distribution transformation (GADT) based on a two-stage outlier hierarchical clustering (OHC) scheme to scale and shift each channel. Firstly, OHC identifies and truncates extreme outliers to reduce the scale variance of different channels. Secondly, OHC iteratively allocates learnable shifting and scaling sizes to each group of channels with similar distributions, reducing the number of learnable parameters and easing the optimization difficulty. These shifting and scaling sizes are used to adjust activation channels, and jointly optimized with quantization step sizes for optimal results. Extensive experiments demonstrate that PQ-SAM outperforms existing PTQ methods on nine zero-shot datasets, and pushes the 4-bit PTQ of SAM to a usable level.

# 132

Strong Double Blind

BKDSNN: Enhancing the Performance of Learning-based Spiking Neural Networks Training with Blurred Knowledge Distillation

Zekai Xu · Kang You · Qinghai Guo · Xiang Wang · Zhezhi He

Spiking neural networks (SNNs), which mimic biological neural system to convey information via discrete spikes, are well known as brain-inspired models with excellent computing efficiency. By utilizing the surrogate gradient estimation for discrete spikes, learning-based SNN training methods that can achieve ultra-low inference latency (number of time-step) emerge recently. Nevertheless, due to the difficulty in deriving precise gradient estimation for discrete spikes using learning-based method, a distinct accuracy gap persists between SNN and its artificial neural networks (ANNs) counterpart. To address the aforementioned issue, we propose a blurred knowledge distillation (BKD) technique, which leverages random blurred SNN feature to restore and imitate the ANN feature. Note that, our BKD is applied upon the feature map right before the last layer of SNN, which can also mix with prior logits-based knowledge distillation for maximized accuracy boost. To our best knowledge, in the category of learning-based methods, our work achieves state-of-the-art performance for training SNNs on both static and neuromorphic datasets. On ImageNet dataset, BKDSNN outperforms prior best results by 4.51% and 0.93% with the network topology of CNN and Transformer respectively.

# 136

Strong Double Blind

ELSE: Efficient Deep Neural Network Inference through Line-based Sparsity Exploration

Zeqi Zhu · Alberto Garcia-Ortiz · Luc Waeijen · Egor Bondarev · Arash Pourtaherian · Orlando Moreira

Brain-inspired computer architecture facilitates low-power and low-latency deep neural network inference for edge AI applications. The hardware performance crucially hinges on the quantity of non-zero activations (referred to as events) during DNN inference. Thus, we propose a novel event suppression method, dubbed ELSE, which enhances DNN Efficiency via Line-based Sparsity Exploration. Specifically, it exploits spatial correlation between adjacent lines in activation maps to reduce network events. Our method achieves a reduction in event-triggered computation ranging from 2.43x to 5.75x for object detection and from 3.7x to 6.49x for pose estimation across various networks compared to conventional processing. Moreover, we empirically demonstrate that a layerwise mixed approach incorporating ELSE with other prominent event suppression methods enables a substantial enhancement in computation savings by up to 8.83x in spatial suppression, or effectively reduces memory consumption by 2~4x in temporal suppression. The results highlight ELSE's significant event suppression ability and its capacity to deliver complementary performance enhancements for state-of-the-art (SOTA) approaches.

# 336

Strong Double Blind

FairViT: Fair Vision Transformer via Adaptive Masking

Bowei Tian · Ruijie Du · Yanning Shen

Vision Transformer (ViT) has achieved excellent performance and demonstrated its promising potential in various computer vision tasks. The wide deployment of ViT in real-world tasks requires a thorough understanding of the societal impact of the model. However, most ViT-based works do not take fairness into account and it is unclear whether directly applying CNN-oriented debiased algorithm to ViT is feasible. Moreover, previous works typically sacrifice accuracy for fairness. Therefore, we aim to develop an algorithm that improves accuracy without sacrificing fairness. In this paper, we propose FairViT, a novel fair ViT framework. To this end, we introduce a novel distance loss and deploy adaptive fairness-aware masks on attention layers updating with model parameters. Experimental results show FairViT can achieve accuracy better than other alternatives, even with competitive computational efficiency. Furthermore, FairViT achieves appreciable fairness results.

# 138

Strong Double Blind

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

KAIXIN Xu · Zhe Wang · Chunyun Chen · Xue Geng · Jie Lin · Xulei Yang · Min Wu · Xiaoli Li · Weisi Lin

Vision transformers (ViTs) have emerged as a promising alternative to convolutional neural networks (CNNs) for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve up to 3.93x and 1.79x speedups on dedicated hardware and GPUs respectively for DeiT-B, and also observe an inference power reduction by 1.4x on real-world GPUs.

# 126

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

Tanvir Mahmud · Burhaneddin Yaman · Chun-Hao Liu · Diana Marculescu

As deep neural networks evolve from convolutional neural networks (ConvNets) to advanced vision transformers (ViTs), there is an increased need to eliminate redundant data for faster processing without compromising accuracy. Previous methods are often architecture-specific or necessitate re-training, restricting their applicability with frequent model updates. To solve this, we first introduce a novel property of lightweight ConvNets: their ability to identify key discriminative patch regions in images, irrespective of model's final accuracy or size. We demonstrate that fully-connected layers are the primary bottleneck for ConvNets performance, and their suppression with simple weight recalibration markedly enhances discriminative patch localization performance. Using this insight, we introduce PaPr, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training. Moreover, the simple early-stage one-step patch pruning with PaPr enhances existing patch reduction methods. Through extensive testing on diverse architectures, PaPr achieves significantly higher accuracy over state-of-the-art patch reduction methods with similar FLOP count reduction. More specifically, PaPr reduces about 70% of redundant patches in videos with less than 0.8% drop in accuracy, and up to 3.7x FLOPs reduction, which is a 15% more reduction with 2.5% higher accuracy. Code will be released.

# 134

Strong Double Blind

CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs

Akshat Ramachandran · Souvik Kundu · Tushar Krishna

We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent data-free quantization techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives

# 141

Strong Double Blind

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Taolin Zhang · Jiawang Bai · Zhihe Lu · Dongze Lian · genping wang · Xinchao Wang · Shu-Tao Xia

Recent works on parameter-efficient transfer learning (PETL) show the potential to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. However, since they usually insert new structures into the pre-trained model, entire intermediate features of that model are changed and thus need to be stored to be involved in back-propagation, resulting in memory-heavy training. We solve this problem from a novel disentangled perspective, i.e., dividing PETL into two aspects: task-specific learning and pre-trained knowledge utilization. Specifically, we synthesize the task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. The synthesized query equipped with task-specific knowledge serves to extract the useful features for downstream tasks from the intermediate representations of the pre-trained model in a query-only manner. Built upon these features, a customized classification head is proposed to make the prediction for the input sample. Given that our method employs a extremely lightweight architecture and avoids the use of heavy intermediate features for running gradient descent, it demonstrates limited memory usage in training. Notably, extensive experiments manifest that our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.

# 144

Strong Double Blind

Characterizing Model Robustness via Natural Input Gradients

Adrian Rodriguez-Munoz · Tongzhou Wang · Antonio Torralba

Adversarially robust models are locally smooth around each data sample so that small perturbations cannot drastically change model outputs. In modern systems, such smoothness is usually obtained via Adversarial Training, which explicitly enforces models to perform well on perturbed examples. In this work, we show the surprising effectiveness of instead regularizing the gradient with respect to model inputs on natural examples only. Penalizing input Gradient Norm is commonly believed to be a much inferior approach. Our analyses identify that the performance of Gradient Norm regularization critically depends on the smoothness of activation functions, and are in fact extremely effective on modern vision transformers that adopt smooth activations over piecewise linear ones (eg, ReLU). On ImageNet-1k, Gradient Norm training achieves > 90% performance of state-of-the-art PGD-3 Adversarial Training (52% vs. 56%), while using only 60% computation cost of the state-of-the-art without complex adversarial optimization. Our analyses further highlight the relationship between model robustness and properties of natural input gradients, such as asymmetric channel statistics. Surprisingly, we also find model robustness can be significantly improved by simply regularizing its gradients to focus on image edges without explicit conditioning on the norm.

# 142

Strong Double Blind

Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning

Zhengyi Fang · Yue Wang · Ran Yi · Lizhuang Ma

Parameter-efficient fine-tuning methods adjust a small subset of parameters in large models, achieving performance comparable to or even surpassing that of models fine-tuned with the full parameter set, and significantly reducing the time and computational costs associated with the fine-tuning process. Despite the developments of parameter-efficient fine-tuning methods for large models, we observe significant performance disparities across different vision tasks. We attribute this pronounced performance variability to the insufficient robustness of current parameter-efficient fine-tuning methods. In this paper, we propose a robust reparameterization framework for parameter-efficient fine-tuning. This framework has a dynamic training structure and introduces no additional computational overhead during the inference stage. Specifically, we propose Dropout-Mixture Low-Rank Adaptation (DMLoRA), which incorporates multiple up and down branches, to provide the model with a more robust gradient descent path. As training proceeds, DMLoRA gradually drops out branches to achieve a balance between accuracy and regularization. Additionally, we employ a 2-Stage Learning Scalar (LS) strategy to optimize the scale factor for each layer's DMLoRA module. Experimental results demonstrate that our method achieves state-of-the-art performance on the benchmark VTAB-1k and FGVC datasets for parameter-efficient fine-tuning.

# 143

Strong Double Blind

FreeAugment: Data Augmentation Search Across All Degrees of Freedom

Tom Bekor · Niv Nayman · Lihi Zelnik-Manor

Data augmentation has become an integral part of deep learning, as it is known to improve the generalization capabilities of neural networks. Since the most effective set of image transformations differs between tasks and domains, automatic data augmentation search aims to alleviate the extreme burden of manually finding the optimal image transformations. However, current methods are not able to jointly optimize all degrees of freedom: (1) the number of transformations to be applied, their (2) types, (3) order, and (4) magnitudes. Many existing methods risk picking the same transformation more than once, limit the search to two transformations only, or search for the number of transformations exhaustively or iteratively in a myopic manner. Our approach, FreeAugment, is the first to achieve global optimization of all four degrees of freedom simultaneously, using a fully differentiable method. It efficiently learns the number of transformations and a probability distribution over their permutations, inherently refraining from redundant repetition while sampling. Our experiments demonstrate that this joint learning of all degrees of freedom significantly improves performance, achieving state-of-the-art results on various natural image benchmarks and beyond across other domains.

# 36

Towards Multi-modal Transformers in Federated Learning

Guangyu Sun · Matias Mendieta · Aritra Dutta · Xin Li · Chen Chen

Multi-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers.

# 40

Strong Double Blind

Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception

TIANYOU LUO · Quan Yuan · Yuchen Xia · Guiyang Luo · Yujia Yang · Jinglin Li

Sharing intermediate neural features enables agents to effectively see through occlusions. Due to agent diversity, some pioneering works have studied domain adaption for heterogeneous neural features. Nevertheless, these works all partially replace agents’ private neural network with newly trained components, which breaks the model integrity and bidirectional compatibility of agents. In this paper, we consider an open challenge: how to learn non-destructive domain adapters for heterogeneous legacy models to achieve collaborative percepingg while compatible with continually emerging new agent models? To overcome this challenge, we propose the first plug-and-play domain adapter (PnPDA) for heterogeneous collaborative perception. PnPDA builds a semantic calibrator based on contrastive learning to supervise domain gap bridging without destructing the original models. Semantic converter is learned to transform the semantic space of features, while semantic enhancer is utilized to enhance the representation of features. By specifying standard semantics, new models with PnPDA can easily join existing collaborations. Extensive experiments on OPV2V dataset show that PnPDA non-destructively bridges the domain gap and outperforms SOTA by 9.13%.

# 83

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

Xiaojie Li · Yibo Yang · Xiangtai Li · Jianlong Wu · Yue Yu · Bernard Ghanem · Min Zhang

Self-supervised learning has achieved remarkable success in acquiring high-quality representations from unlabeled data. The widely adopted contrastive learning framework aims to learn invariant representations by minimizing the distance between positive views originating from the same image. However, existing techniques to construct positive views highly rely on manual transformations, resulting in limited diversity and potentially false positive pairs. To tackle these challenges, we present GenView, a controllable framework that augments the diversity of positive views leveraging the power of pretrained generative models while preserving semantics. %These models have the capability to generate highly diversified images while preserving the semantics of conditional embeddings. We develop an adaptive view generation method that dynamically adjusts the noise level in sampling to ensure the preservation of essential semantic meaning while introducing variability. Additionally, we introduce a quality-driven contrastive loss, which assesses the quality of positive pairs by considering both foreground similarity and background diversity. This loss prioritizes the high-quality positive pairs we construct while reducing the influence of low-quality pairs, thereby mitigating potential semantic inconsistencies introduced by generative models and aggressive data augmentation. Thanks to the improved positive view quality and the quality-driven contrastive loss, GenView significantly improves self-supervised learning across various tasks. For instance, GenView improves MoCov2 performance by 2.5%/2.2% on ImageNet linear/semi-supervised classification. Moreover, GenView even performs much better than naively augmenting the ImageNet dataset with Laion400M or ImageNet21K. Our code is included in the supplementary material and will be released publicly.

# 33

Soft Prompt Generation for Domain Generalization

Shuanghao Bai · Yuedi Zhang · Wanqi Zhou · Zhirong Luan · Badong Chen

Large pre-trained vision language models (VLMs) have shown impressive zero-shot ability on downstream tasks with manually designed prompt, which are not optimal for specific domains. To further adapt VLMs to downstream tasks, soft prompt is proposed to replace manually designed prompt, which acts as a learning vector that undergoes fine-tuning based on specific domain data. Prior prompt learning methods primarily learn a fixed prompt and residuled prompt from training samples. However, the learned prompts lack diversity and ignore information about unseen domains, potentially compromising the transferability of the prompts. In this paper, we reframe the prompt learning framework from a generative perspective and propose a simple yet efficient method for the Domain Generalization (DG) task, namely Soft Prompt Generation (SPG). To the best of our knowledge, we are the first to introduce the generative model into prompt learning in VLMs and explore its potential for producing soft prompts by relying solely on the generative model, ensuring the diversity of prompts. Specifically, SPG consists of a two-stage training phase and an inference phase. During the training phase, we introduce soft prompt labels for each domain, aiming to incorporate the generative model domain knowledge. During the inference phase, the generator of the generative model is employed to obtain instance-specific soft prompts for the unseen target domain. Extensive experiments on five domain generalization benchmarks of three DG tasks demonstrate that our proposed SPG achieves state-of-the-art performance. The code is available in supplementary materials.

# 44

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Ankit Vani · Bac Nguyen · Samuel Lavoie · Ranjay Krishna · Aaron Courville

Selective attention is an intrinsic property of human cognition, allowing us to focus on relevant stimuli while filtering out distractions. The information bottleneck principle suggests a similar adaptation for generalizing in machine learning, but representation learning frameworks typically do not have prior knowledge of downstream tasks. In this work, we present SPARO, a read-out mechanism for transformers that provides an inductive bias for learning representations comprised of different ways of performing selective attention. Concretely, it structures representations as a concatenation of outputs from separate single-head attention operations with embedded queries. SPARO improves generalization of CLIP on zero-shot recognition, robustness, retrieval, and compositionality benchmarks, and improves linear probe accuracy on ImageNet. It also improves the ImageNet linear probe and k-nearest neighbor accuracies of DINO. We showcase the ability to post-hoc intervene and select concepts for downstream tasks from the SPARO representation, which can offer further improvements. We provide insights behind the design of SPARO, including ablation experiments, analysis of its representational robustness, and visualization of the attended concepts.

# 92

Strong Double Blind

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Sukrut Rao · Sweta Mahajan · Moritz Böhle · Bernt Schiele

Concept Bottleneck Models (CBMs) have recently been proposed to address the `black-box' problem of deep neural networks, by first mapping images to a human-understandable concept space and then linearly combining concepts for classification. Such models typically require first coming up with a set of concepts relevant to the task and then aligning the representations of a feature extractor to map to these concepts. However, even with powerful foundational feature extractors like CLIP, there are no guarantees that the specified concepts are detectable, calling into question the faithfulness of using them as explanations. In this work, we leverage recent advances in mechanistic interpretability and propose a novel CBM approach --- called Discover-then-Name-CBM (DN-CBM) --- that inverts the typical paradigm: instead of pre-selecting concepts based on the downstream classification task, we use sparse autoencoders to first discover concepts learnt by the model, and then name them and train linear probes for classification. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model. We perform a comprehensive evaluation across multiple datasets and CLIP architectures and show that our method yields semantically meaningful concepts, assigns appropriate names to them that make them easy to interpret, and yields performant and interpretable CBMs.

# 114

Strong Double Blind

Deep Online Probability Aggregation Clustering

Yuxuan Yan · Na Lu · Ruofan Yan

Combining machine clustering with deep models has shown remarkable superiority in deep clustering, which modifies the data processing pipeline into two alternating phases: feature clustering and model training. However, such alternating schedule may lead to instability and computational burden issues. To tackle these challenges, we propose a centerless clustering algorithm called Probability Aggregation Clustering (PAC) to proactively adapt deep learning technologies, enabling easy deployment in online deep clustering. PAC circumvents the cluster center and aligns the probability space and distribution space by formulating clustering as an optimization problem with a novel objective function. Based on the computation mechanism of the PAC, we propose a general online probability aggregation mperform stable feature clustering overodule to mini-batch data and further construct a deep visual clustering framework deep PAC (DPAC). Extensive experiments demonstrate that PAC has superior clustering robustness and performance and DPAC remarkably outperforms the state-of-the-art deep clustering methods.

# 117

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Harsh Shah · Kashish Mittal · Ajit Rajwade

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. Our method has the following features: (i) It exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

# 107

Strong Double Blind

An accurate detection is not all you need to combat label noise in web-noisy datasets

Paul Albert · Kevin McGuinness · Eric Arazo · Tarun Krishna · Noel O Connor · Jack Valmadre

Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. This urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise. The code for reproducing the experiments in this paper will be publicly released upon acceptance.

# 99

Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration

Emanuel Sanchez Aimar · Nathaniel D Helgesen · Yonghao Xu · Marco Kuhlmann · Michael Felsberg

Long-tailed semi-supervised learning (LTSSL) represents a practical scenario for semi-supervised applications, challenged by skewed labeled distributions that bias classifiers. This problem is often aggravated by discrepancies between labeled and unlabeled class distributions, leading to biased pseudo-labels, neglect of rare classes, and poorly calibrated probabilities. To address these issues, we introduce Flexible Distribution Alignment (FlexDA), a novel adaptive logit-adjusted loss framework designed to dynamically estimate and align predictions with the actual distribution of unlabeled data and achieve a balanced classifier by the end of training. FlexDA is further enhanced by a distillation-based consistency loss, promoting fair data usage across classes and effectively leveraging underconfident samples. This method, encapsulated in ADELLO (Align and Distill Everything All at Once), proves robust against label shift, significantly improves model calibration in LTSSL contexts, and surpasses previous state-of-of-art approaches across multiple benchmarks, including CIFAR100-LT, STL10-LT, and ImageNet127, addressing class imbalance challenges in semi-supervised learning. Our code will be made available upon paper acceptance.

# 87

Strong Double Blind

ExMatch: Self-guided Exploitation for Semi-Supervised Learning with Scarce Labeled Samples

Noo-ri Kim · Jin-Seop Lee · Jee-Hyong LEE

Semi-supervised learning is a learning method that uses both labeled and unlabeled samples to improve the performance of the model while reducing labeling costs. When there were tens to hundreds of labeled samples, semi-supervised learning methods showed good performance, but most of them showed poor performance when only a small number of labeled samples were given. In this paper, we focus on challenging label-scarce environments, where there are only a few labeled samples per class. Our proposed model, ExMatch, is designed to obtain reliable information from unlabeled samples using self-supervised models and utilize it for semi-supervised learning. In the training process, ExMatch guides the model to maintain an appropriate distribution and resist learning from incorrect pseudo-labels based on the information from self-supervised models and its own model. ExMatch shows very stable training progress and the state-of-the-art performances on multiple benchmark datasets. In extremely label-scare situations, performances are improved by about 5% to 21% for CIFAR-10/100 and SVHN. ExMatch also demonstrates significant performance improvements in high-resolution and large-scale dataset such as STL-10, Tiny-ImageNet, and ImageNet.

# 93

Strong Double Blind

Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement

Aditay Tripathi · Pradeep Shenoy · Anirban Chakraborty

Self-supervised learning (SSL) is critical for learning high-quality representations from unlabeled images at scale. Earlier efforts at reducing the compute requirements of SSL have focused on identifying subsets of training data that are sufficient for training. In addition to using a static representative subset, these methods also require small amounts of labeled data for scoring instances. In this work, we design a new family of algorithms that exploits the training dynamics of SSL methods and adjusts the selected subset throughout the training process. Our proposal has two key components: a) a \textit{coarse-to-fine refinement} schedule for training data, where initial training rounds are performed on larger subsets of data, and the selected subset shrinks throughout the training process, and b) the use of an \textit{unsupervised proxy model} that dynamically selects training instances based on their informativeness for the model’s current state. We also use the proxy model to speed up initial learning by aligning the representations of the primary and proxy models using an additional regularization loss. We validate our method on public benchmarks (CIFAR100, CIFAR10, TinyImagenet, and STL10) and document significant gains in our compute-accuracy tradeoff compared to previous approaches. Notably, we show a 31.6\% reduction in computational load on TinyImagenet while maintaining classification accuracy.

# 89

Strong Double Blind

SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery

Sarah Rastegar · Mohammadreza Salehi · Yuki M Asano · Hazel Doughty · Cees Snoek

In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to providesoft supervision', improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard' negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets.

# 91

Strong Double Blind

Dynamic Retraining-Updating Mean Teacher for Source-Free Object Detection

BA KHANH TRINH LE · Huy-Hung Nguyen · Long Hoang Pham · Duong Nguyen-Ngoc Tran · Jae Wook Jeon

In object detection, unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. However, UDA's reliance on labeled source data restricts its adaptability in privacy-related scenarios. This study focuses on source-free object detection (SFOD), which adapts a source-trained detector to an unlabeled target domain without using labeled source data. Recent advancements in self-training, particularly with the Mean Teacher (MT) framework, show promise for SFOD deployment. However, the absence of source supervision significantly compromises the stability of these approaches. We identify two primary issues, (1) uncontrollable degradation of the teacher model due to inopportune updates from the student model, and (2) the student model's tendency to replicate errors from incorrect pseudo labels, leading to being trapped in a local optimum. Both factors contribute to a detrimental circular dependency, resulting in rapid performance degradation in recent self-training frameworks. To tackle these challenges, we propose the Dynamic Retraining-Updating (DRU) mechanism, which actively manages the student training and teacher updating processes to achieve co-evolutionary training. Additionally, we introduce Historical Student Loss to mitigate the influence of incorrect pseudo labels. Our method achieves state-of-the-art performance in the SFOD setting on multiple domain adaptation benchmarks, comparable to or even surpassing advanced UDA methods. The code will be released soon.

# 90

Strong Double Blind

Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization

Naiyu Yin · Hanjing Wang · Yue Yu · Tian Gao · Amit Dhurandhar · Qiang Ji

The pursuit of generalizable representations remains to be a dynamic field in the realm of machine learning and computer vision. Existing methods aim to secure invariant representations by either harnessing domain expertise or leveraging data from multiple domains. In this paper, we propose a novel approach that identifies the Causal Markov Blanket (CMB) representations and improves the Out-of-distribution prediction performance. We establish a framework guided by a structural causal model (SCM) describing the data generation process, allowing for the causal Markov Blanket discovery in the latent space. We then construct an invariant prediction mechanism using CMB features, suitable for performing prediction across domains. In comparison to state-of-the-art domain generalization methods, our approach exhibits robustness and adaptability under distribution shifts.

# 86

Strong Double Blind

Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence

Mengyao Lyu · Tianxiang Hao · Xinhao Xu · Hui Chen · Zijia Lin · Jungong Han · Guiguang Ding

Domain Adaptation (DA) facilitates knowledge transfer from a source domain to a related target domain. In this paper, we investigate a practical DA paradigm, namely Source data-Free Active Domain Adaptation (SFADA), where source data becomes inaccessible during adaptation, and a minimum amount of annotation budget is available in the target domain. Without referencing the source data, new challenge emerges in identifying the most informative target samples for labeling, establishing cross-domain alignment during adaptation, and ensuring continuous performance improvements through the iterative query-and-adaptation process. We present learn from the learnt (LFTL), a novel paradigm for SFADA to leverage the learnt knowledge from the source pretrained model and actively iterated models without extra overhead. We propose Contrastive Active Sampling to learn from the hypotheses of the preceding model, thereby querying target samples that are both informative to the current model and persistently challenging throughout active learning. During adaptation, we learn from features of actively selected anchors obtained from previous intermediate models, so that the Visual Persistence-guided Adaptation can facilitate feature distribution alignment and active sample exploitation. Extensive experiments on three widely-used benchmarks show that our LFTL achieves state-of-the-art performance, superior computational efficiency and continuous improvements as the annotation budget increases. Our code will be available at https://github.com/xxx.

# 94

Strong Double Blind

On the Approximation Risk of Few-Shot Class-Incremental Learning

Xuan Wang · Zhong Ji · Xiyao Liu · Yanwei Pang · Jungong Han

Few-Shot Class-Incremental Learning (FSCIL) aims to learn new concepts with few training samples while preserving previously acquired knowledge. Although promising performance has been achieved, there remains an underexplored aspect regarding the basic statistical principles underlying FSCIL. Therefore, we thoroughly explore the approximation risk of FSCIL, encompassing both transfer and consistency risks. By tightening the upper bounds of these risks, we derive practical guidelines for designing and training FSCIL models. These guidelines include (1) expanding training datasets for base classes, (2) preventing excessive focus on specific features, (3) optimizing classification margin discrepancy, and (4) ensuring unbiased classification across both base and novel classes. Leveraging these insights, we conduct comprehensive experiments to validate our principles, achieving state-of-the-art performance on three FSCIL benchmark datasets.

# 102

Strong Double Blind

STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay

Yu Yongcan · Lijun Sheng · Ran He · Jian Liang

Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time. Existing TTA methods often focus on improving recognition performance specifically for test data associated with classes in the training set. However, during the open-world inference process, there are inevitably test data instances from unknown classes, commonly referred to as outliers. This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist. To address this problem, we propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch. In particular, the memory bank is dynamically updated by selecting low-entropy and label-consistent samples in a class-balanced manner. In addition, we develop a self-weighted entropy minimization strategy that assigns higher weight to low-entropy samples. Extensive results demonstrate that STAMP outperforms existing TTA methods in terms of both recognition and outlier detection performance. Code is attached to the supplementary materials.

# 342

Strong Double Blind

RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-based Continual Learning

Longrong Yang · Hanbin Zhao · Yunlong Yu · Xiaodong Zeng · Xi Li

Prompt-based Continual Learning is an emerging direction in leveraging pre-trained knowledge for downstream continual learning. While arriving at a new session, existing prompt-based continual learning methods usually adapt features from pre-trained models to new data by introducing prompts. However, these prompts lack an optimization objective explicitly modeling inter-session class relationships, thus failing to construct clear inter-session class margins. Moreover, some old samples use new prompts during inference, resulting in the prompt-ambiguity overlap space - a special situation where old and new class spaces overlap. To address these issues, we propose an innovative approach called RCS-Prompt to rearrange class space by bidirectionally optimizing prompts. RCS-Prompt optimizes prompts to signify discriminative regions across different sessions in the class space. Additionally, it mitigates the prompt-ambiguity overlap space by altering the labels of a small subset of new samples to old classes and training them with a customized symmetric loss. The proposed method effectively reduces the overlap between old and new class spaces, thereby establishing clear inter-session class margins. We extensively evaluate RCS-Prompt on public datasets, demonstrating its effectiveness in prompt-based continual learning.

# 100

Strong Double Blind

CLEO: Continual Learning of Evolving Ontologies

Shishir Muralidhara · Saqib Bukhari · Georg Dr. Schneider · Didier Stricker · René Schuster

Continual learning (CL) addresses the problem of catastrophic forgetting in neural networks, which occurs when a trained model tends to overwrite previously learned information, when presented with a new task. CL aims to instill the lifelong learning characteristic of humans in intelligent systems, making them capable of learning continuously while retaining what was already learned. Current CL problems involve either learning new domains (domain-incremental) or new and previously unseen classes (class-incremental). However, general learning processes are not just limited to learning information, but also refinement of existing information. In this paper, we define CLEO -- Continual Learning of Evolving Ontologies, as a new incremental learning setting under CL to tackle evolving classes. CLEO is motivated by the need for intelligent systems to adapt to real-world ontologies that change over time, such as those in autonomous driving. We use Cityscapes, PASCAL VOC, and Mapillary Vistas to define the task settings and demonstrate the applicability of CLEO. We highlight the shortcomings of existing CIL methods in adapting to CLEO and propose a baseline solution, called Modelling Ontologies (MoOn). CLEO is a promising new approach to CL that addresses the challenge of evolving ontologies in real-world applications. MoOn surpasses previous CL approaches in the context of CLEO.

# 95

Strong Double Blind

Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning

Seokwon Shin · Hyungrok Do · Youngdoo Son

Multi-task learning is a popular machine learning approach that enables simultaneous learning of multiple related tasks, improving algorithmic efficiency and effectiveness. In the hard parameter sharing approach, an encoder shared through multiple tasks generates data representations passed to task-specific predictors. Therefore, it is crucial to have a shared encoder that provides decent representations for every and each task. However, despite recent advances in multi-task learning, the question of how to improve the quality of representations generated by the shared encoder remains open. To address this gap, we propose a novel approach called Dummy Gradient Norm Regularization (DGR) that aims to improve the universality of the representations generated by the shared encoder. Specifically, the method decreases the norm of the gradient of the loss function with respect to dummy task-specific predictors to improve the universality of the shared encoder’s representations. Through experiments on multiple multi-task learning benchmark datasets, we demonstrate that DGR effectively improves the quality of the shared representations, leading to better multi-task prediction performances. Applied to various classifiers, the shared representations generated by DGR also show superior performance compared to existing multi-task learning methods. Moreover, our approach takes advantage of computational efficiency due to its simplicity. The simplicity also allows us to integrate DGR with the existing multi-task learning algorithms seamlessly.

# 108

Improving Knowledge Distillation via Regularizing Feature Direction and Norm

Yuzhu Wang · Lechao Cheng · Manni Duan · Yongheng Wang · Zunlei Feng · Shu Kong

Knowledge distillation (KD) is a particular technique of model compression that exploits a large well-trained {\tt teacher} neural network to train a small {\tt student} network . Treating {\tt teacher}'s feature as knowledge, prevailing methods train {\tt student} by aligning its features with the {\tt teacher}'s, e.g., by minimizing the KL-divergence or L2-distance between their (logits) features. While it is natural to assume that better feature alignment helps distill {\tt teacher}'s knowledge, simply forcing this alignment does not directly contribute to the {\tt student}'s performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better {\tt student} classifier. We are motivated to regularize {\tt student} features at the penultimate layer using {\tt teacher} towards training a better {\tt student} classifier. Specifically, we present a rather simple method that uses {\tt teacher}'s class-mean features to align {\tt student} features w.r.t their {\em direction}. Experiments show that this significantly improves KD performance. Moreover, we empirically find that {\tt student} produces features that have notably smaller norms than {\tt teacher}'s, motivating us to regularize {\tt student} to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes {\tt student} by simultaneously (1) aligning the \emph{direction} of its features with the {\tt teacher} class-mean feature, and (2) encouraging it to produce large-\emph{norm} features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).

# 326

MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution

Yuxuan Jiang · Chen Feng · Fan Zhang · David Bull

Knowledge distillation (KD) has emerged as a promising technique in deep learning, typically employed to enhance a compact student network through learning from their high-performance but more complex teacher variant. When applied in the context of image super-resolution, most KD approaches are modified versions of methods developed for other computer vision tasks, which are based on training strategies with a single teacher and simple loss functions. In this paper, we propose a novel Multi-Teacher Knowledge Distillation (MTKD) framework specifically for image super-resolution. It exploits the advantages of multiple teachers by combining and enhancing the outputs of these teacher models, which then guides the learning process of the compact student network. To achieve more effective learning performance, we have also developed a new wavelet-based loss function for MTKD, which can better optimize the training process by observing differences in both the spatial and frequency domains. We fully evaluate the effectiveness of the proposed method by comparing it to five commonly used KD methods for image super-resolution based on three popular network architectures. The results show that the proposed MTKD method achieves evident improvements in super-resolution performance, up to 0.46dB (based on PSNR), over state-of-the-art KD approaches across different network structures. The source code of MTKD will be made available here for public evaluation.

# 97

Strong Double Blind

Federated Learning with Local Openset Noisy Labels

Zonglin Di · Zhaowei Zhu · Xiaoxiao Li · Yang Liu

Federated learning (FL) is a learning paradigm that allows the central server to learn from different data sources while keeping the data private locally. Without controlling and monitoring the local data collection process, the locally available training labels are likely noisy, i.e., the collected training labels differ from the unobservable ground truth. Additionally, in heterogenous FL, each local client may only have access to a subset of label space (referred to as openset label learning), meanwhile without overlapping with others. In this work, we study the challenge of FL with local openset noisy labels. We observe that many existing solutions in the noisy label literature, e.g., loss correction, are ineffective during local training due to overfitting to noisy labels and being not generalizable to openset labels. For the methods in FL, different estimated metrics are shared. To address the problems, we design a label communication mechanism that shares "contrastive labels" randomly selected from clients with the server. The privacy of the shared contrastive labels is protected by label differential privacy (DP). Both the DP guarantee and the effectiveness of our approach are theoretically guaranteed. Compared with several baseline methods, our solution shows its efficiency in several public benchmarks and real-world datasets under different noise ratios and noise models.

# 109

Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents

Yuqi Jia · Saeed Vahidian · Jingwei Sun · Jianyi Zhang · Vyacheslav Kungurtsev · Neil Zhenqiang Gong · Yiran Chen

Data heterogeneity presents significant challenges for federated learning (FL). Recently, dataset distillation techniques have been introduced, and performed at the client level, to attempt to mitigate some of these challenges. In this paper, we propose a highly efficient FL dataset distillation framework on the \emph{server} side, significantly reducing both the computational and communication demands on local devices while enhancing the clients' privacy. Unlike previous strategies that perform dataset distillation on local devices and upload synthetic data to the server, our technique enables the server to leverage prior knowledge from pre-trained deep generative models to synthesize essential data representations from a heterogeneous model architecture. This process allows local devices to train smaller surrogate models while enabling the training of a larger global model on the server, effectively minimizing resource utilization. We substantiate our claim with a theoretical analysis, demonstrating the asymptotic resemblance of the process to the hypothetical ideal of completely centralized training on a heterogeneous dataset. Empirical evidence from our comprehensive experiments indicates our method's superiority, delivering an accuracy enhancement of up to 40\% over non-dataset-distillation techniques in highly heterogeneous FL contexts, and surpassing existing dataset-distillation methods by 18\%. In addition to the high accuracy, our framework converges faster than the baselines because rather than the server trains on several sets of heterogeneous data distributions, it trains on a multi-modal distribution.

# 106

Strong Double Blind

FedHARM: Harmonizing Model Architectural Diversity in Federated Learning

Anestis Kastellos · Athanasios Psaltis · Charalampos Z Patrikakis · Petros Daras

In the domain of Federated Learning (FL), the issue of managing variability in model architectures surpasses a mere technical barrier, representing a crucial aspect of the field's evolution, especially considering the ever-increasing number of model architectures emerging in the literature. This focus on architecture variability emerges from the unique nature of FL, where diverse devices or participants, each with their own data and computational constraints, collaboratively train a shared model. The proposed FL system architecture facilitates the deployment of diverse convolutional neural network (CNN) architectures across distinct clients, while outperforming the state-of-the-art FL methodologies. FedHARM capitalizes on the strengths of different architectures while limiting their weaknesses by converging each local client on a shared dataset to achieve superior performance on the test set.

# 88

Strong Double Blind

Causal Subgraphs and Information Bottlenecks: Redefining OOD Robustness in Graph Neural Networks

Weizhi An · Wenliang Zhong · Feng Jiang · Hehuan Ma · Junzhou Huang

Graph Neural Networks (GNNs) are increasingly popular in processing graph-structured data, yet they face significant challenges when training and testing distributions diverge, common in real-world scenarios. This divergence often leads to substantial performance drops in GNN models. To address this, our study introduces a novel approach that effectively enhances GNN performance in Out-of-Distribution (OOD) scenarios. We propose a method CSIB guided by causal modeling principles to generate causal subgraphs, while concurrently consider both Fully Informative Invariant Features (FIIF) and Partially Informative Invariant Features (PIIF) situations. Our approach uniquely combines the principles of invariant risk minimization and graph information bottleneck. This integration not only guides the generation of causal subgraphs but also underscores the necessity of balancing invariant principles with information compression in the face of various distribution shifts. We validate our model through extensive experiments across diverse shift types, demonstrating its effectiveness in maintaining robust performance under OOD conditions.

# 115

Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks

Jing Wu · Mehrtash Harandi

Machine unlearning has become a pivotal task to erase the influence of data from a trained model. It adheres to recent data regulation standards and enhances the privacy and security of machine learning applications. In this work, we present a new machine unlearning approach Scissorhands. Initially, Scissorhands identifies the most pertinent parameters in the given model relative to the forgetting data via connection sensitivity. By reinitializing the most influential top-k percent of these parameters, a trimmed model for erasing the influence of the forgetting data is obtained. Subsequently, Scissorhands fine-tunes the trimmed model with a gradient projection-based approach, seeking parameters that preserve information on the remaining data while discarding information related to the forgetting data. Our experimental results, conducted across image classification and image generation tasks, demonstrate that Scissorhands, showcases competitive performance when compared to existing methods. Source code is available at https://github.com/AnonymousUser-hi/Scissorhands.

# 112

Strong Double Blind

Shedding More Light on Robust Classifiers under the lens of Energy-based Models

Mujtaba Hussain Mirza · Maria Rosaria Briglia · Senad Beadini · Iacopo Masi

By reinterpreting a robust discriminative classifier as Energy-based Model (EBM), we offer a new take on the dynamics of adversarial training. By analyzing the energy landscape of adversarial training (AT), we show that untargeted attacks generate adversarial images much more in-distribution (lower energy) than the original data; we observe the opposite for targeted attacks. On the ground of our thorough analysis, we present new theoretical and practical results that show how interpreting AT energy dynamics unlocks a better understanding: (1) AT dynamic is governed by three phases and robust overfitting occurs in the third phase with a drastic divergence between natural and adversarial energies (2) rewriting TRADES as an EBM, we show that TRADES implicitly alleviates overfitting by means of aligning the natural energy with the adversarial one (3) we empirically show that all recent state-of-the-art robust classifiers are smoothing the energy landscape and we reconcile a variety of studies about understanding AT and weighting the loss function under the umbrella of EBMs. Motivated by rigorous evidence, we propose Weighted Energy Adversarial Training (WEAT), a novel sample weighting scheme that yields robust accuracy matching the state-of-the-art on multiple benchmarks such as CIFAR-10 and SVHN and going beyond in CIFAR-100 and Tiny-ImageNet. We further show that robust classifiers vary in the intensity and quality of their generative capabilities, and offer a simple method to push this capability reaching a remarkable Inception Score (IS) and FID using a robust classifier without training for generative modeling. Our models will be released on RobustBench, and the code for reproducing our work can be found at hidden link.

# 127

Strong Double Blind

Inter-Class Topology Alignment for Efficient Black-Box Substitute Attacks

lingzhuang meng · Mingwen Shao · Yuanjian Qiao · Wenjie Liu

In black-box attacks based on substitute training, the similarity of the substitute model to the target model is critical for successful attacks. However, existing schemes merely train the substitute model to mimic the outputs of the target model without fully simulating the decision space, resulting in the adversarial samples generated by the substitute model being classified into the non-target class by the target model. To alleviate this issue, we propose a novel Inter-Class Topology Alignment (ICTA) scheme to more comprehensively simulate the target model by aligning the inter-class positional relationships of two models in the decision space. Specifically, we first design the Position Exploration Sample (PES) to more thoroughly explore the relative positional relationships between classes in the decision space of the target model. Subsequently, we align the inter-class topology between the two models by utilizing the PES to constrain the inter-class relative position of the substitute model in different directions. In this way, the substitute model is more consistent with the target model in the decision space, so that the generated adversarial samples will be more successful in misleading the target model to classify them into the target class. The experimental results demonstrate that our ICTA significantly improves attack success rate in various scenarios compared to existing substitute training methods, particularly performing efficiently in target attacks.

# 121

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

Xuelong Dai · Kaisheng Liang · Bin Xiao

Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often directly inject Projected Gradient Descent (PGD) gradients into the sampling of generative models, which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for GAN-based methods on large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable in generating high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective in generating unrestricted adversarial examples, which outperforms state-of-the-art unrestricted adversarial attack methods in terms of attack performance and generation quality.

# 119

Strong Double Blind

FedHide: Federated Learning by Hiding in the Neighbors

Hyunsin Park · Sungrack Yun

We propose a prototype-based federated learning method tailored for embedding networks in classification or verification tasks. Our focus lies in scenarios where each client possesses data from only one class. The central challenge arises from the need to learn an embedding network capable of discriminating between different classes while respecting privacy constraints. Sharing true class prototypes with the server or other clients could potentially compromise sensitive information. To address this, we introduce a proxy class prototype that can be safely shared among clients. Our approach involves generating proxy class prototypes by linearly combining them with their nearest neighbors. This technique conceals the true class prototype while enabling clients to learn discriminative embedding networks. We compare our method against alternative techniques, including random Gaussian noise addition and random selection with cosine similarity constraints. Additionally, we evaluate the robustness of our approach against gradient inversion attacks and introduce a prototype leakage measure to quantify the extent of private information revealed when sharing the proposed proxy class prototype. Furthermore, we provide a theoretical convergence analysis of our approach. Empirical results on three benchmark datasets CIFAR-100, VoxCeleb1, and VGGFace2 demonstrate the effectiveness of our proposed method for federated learning from scratch.

# 120

Strong Double Blind

SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks

Abhishek Singh · Vivek Sharma · Rohan Sukumaran · John J Mose · Jeffrey K Chiu · Justin Yu · Ramesh Raskar

In this work, we tackle the question of how to benchmark reconstruction of inputs from deep neural networks~(DNN) representations. This inverse problem is of great importance in the privacy community where obfuscation of features has been proposed as a technique for privacy-preserving machine learning~(ML) inference. In this benchmark, we characterize different obfuscation techniques and design different attack models. We propose multiple reconstruction techniques based upon distinct background knowledge of the adversary. We develop a modular platform that integrates different obfuscation techniques, reconstruction algorithms, and evaluation metrics under a common framework. Using our platform, we benchmark various obfuscation and reconstruction techniques for evaluating their privacy-utility trade-off. Finally, we release a dataset of obfuscated representations to foster research in this area. We have open-sourced code, dataset, hyper-parameters, and trained models that can be found at \url{https://tiny.cc/simba}.

# 128

Strong Double Blind

Data Poisoning Quantization Backdoor Attack

Tran Huynh · Anh Tran · Khoa Doan · Tung Pham

Deep learning (DL) models are often large and require a lot of computing power. Hence, model quantization is frequently used to reduce their size and complexity, making them more suitable for deployment on edge devices or achieving real-time performance. It has been previously shown that standard quantization frameworks can be exploited to activate the backdoor in a DL model. This means that an attacker could create a hijacked model that appears normal and free from backdoors (even when examined by state-of-the-art defenses), but when it is quantized, the backdoor is activated, and the attacker can control the model’s output. Existing backdoor attack methods on quantization models require full access to the victim model, which might not hold in practice. In this work, we focus on designing a novel quantization backdoor based on data poisoning, which requires zero knowledge of the target model. The key component is a trigger pattern generator, which is trained together with a surrogate model in an alternating manner. The attack’s effectiveness is tested on multiple benchmark datasets, including CIFAR10, CelebA, and ImageNet10, as well as state-of-the-art backdoor defenses.

# 37

Strong Double Blind

Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics

Shishira R Maiya · Anubhav Anubhav · Matthew Gwilliam · Max Ehrlich · Abhinav Shrivastava

Implicit Neural Networks (INRs) have emerged as powerful representations to encode all forms of data, including images, videos, audios, and scenes. With video, many INRs for video have been proposed for the compression task, and recent methods feature significant improvements with respect to encoding time, storage, and reconstruction quality. However, these encoded representations lack semantic meaning, so they cannot be used for any downstream tasks that require such properties, such as retrieval. This can act as a barrier for adoption of video INRs over traditional codecs as they do not offer any significant edge apart from compression. To alleviate this, we propose a flexible framework that decouples the spatial and temporal aspects of the video INR. We accomplish this with a dictionary of per-frame latents that are learned jointly with a set of video specific hypernetworks, such that given a latent, these hypernetworks can predict the INR weights to reconstruct the given frame. This framework not only retains the compression efficiency, but the learned latents can be aligned with features from large vision models, which grants them discriminative properties. We align these latents with CLIP and show good performance for both compression and video retrieval tasks. By aligning with VideoLlama, we are able to perform open-ended chat with our learned latents as the visual inputs. Additionally, the learned latents serve as a proxy for the underlying weights, allowing us perform tasks like video interpolation. These semantic properties and applications, existing simultaneously with ability to perform compression, interpolation, and super-resolution properties, are a first in this field of work.

# 25

Strong Double Blind

Generalizing to Unseen Domains via Text-guided Augmentation

Daiqing Qi · Handong Zhao · Aidong Zhang · Sheng Li

To avoid the high cost of collecting visual data from all test domains in the domain adaptation task, recent work takes advantage of the pre-trained large-scale vision language models, such as CLIP, and augments training data with only text descriptions (e.g.,``a photo/painting/sketch...'') of each test domain. However, in many real-world applications, such text information of test domains is not always available in advance. Moreover, even if we can verbalize all test domains, it is laborious for existing work (Dunlap et al., 2023) to train a different augmentation network for each possible unseen domain, which suffers from time-inefficiency. To overcome these challenges, we benefit from the multimodal embedding space of a pre-trained vision-language model and propose to acquire training-free and domain-invariant augmentations with text descriptions of arbitrary crafted unseen domains, which not necessarily match test domains. Beyond achieving state-of-the-art results, compared with existing works that require trainable augmentation networks, our approach is also notably more time-efficient, and exhibits a more solid theoretical support. Code will be publicly available.

# 130

Strong Double Blind

Event Trojan: Asynchronous Event-based Backdoor Attacks

Ruofei Wang · Qing Guo · Haoliang Li · Renjie Wan

This paper has uncovered the possibility of directly poisoning event data streams by proposing Event Trojan framework, including two kinds of triggers, i.e., immutable and mutable triggers. Specifically, our two types of event triggers are based on a sequence of simulated event spikes, which can be easily incorporated into any event stream to initiate backdoor attacks. Additionally, for the mutable trigger, we design an adaptive learning mechanism to maximize its aggressiveness. To improve the stealthiness, we introduce a novel loss function that constrains the generated contents of mutable triggers, minimizing the difference between triggers and original events while maintaining effectiveness. Extensive experiments on public event datasets show the effectiveness of the proposed backdoor triggers. We hope that this paper can draw greater attention to the potential threats posed by backdoor attacks on event-based tasks.