Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 6

Exhibition Area
Thu 3 Oct 7:30 a.m. PDT — 9:30 a.m. PDT
Abstract:
Chat is not available.


# 161
Exact Diffusion Inversion via Bidirectional Integration Approximation

Guoqiang Zhang · j.p. lewis · W. Bastiaan Kleijn

Recently, various methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT [39] and Null-text inversion [23]. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, named bi-directional integration approximation (BDIA), to perform exact diffusion inversion with negligible computational overhead. Suppose we would like to estimate the next diffusion state z{i-1} at timestep ti with the historical information (i, zi) and (i+1, z{i+1}). We first obtain the estimated Gaussian noise epsilon(zi, i), and then apply the DDIM update procedure twice for approximating the ODE integration over the next time-slot [ti, t{i-1}] in the forward manner and the previous time-slot [ti, t{t+1}] in the backward manner. The DDIM step for the previous time-slot is used to refine the integration approximation made earlier when computing zi. A nice property of BDIA-DDIM is that the update expression for z{i-1} is a linear combination of (z{i+1}, zi, epsilon(zi, i)). This allows for exact backward computation of z{i+1} given (zi, z_{i-1}), thus leading to exact diffusion inversion. We perform a convergence analysis for BDIA-DDIM that includes the analysis for DDIM as a special case. It is demonstrated with experiments that BDIA-DDIM is effective for (round-trip) image editing. Our experiments further show that BDIA-DDIM produces markedly better image sampling quality than DDIM and EDICT for text-to-image generation and conventional image sampling.


# 217
Strong Double Blind
ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

Shaozhe Hao · Kai Han · Zhengyao Lv · Shihao Zhao · Kwan-Yee K. Wong

While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task.


# 160
Tackling Structural Hallucination in Image Translation with Local Diffusion

Seunghoi Kim · Chen Jin · Tom Diethe · Matteo Figini · Henry FJ Tregidgo · Asher Mullokandov · Philip A Teare · Daniel Alexander

Recent developments in diffusion models have advanced conditioned image generation, yet they struggle with reconstructing out-of-distribution (OOD) images, such as unseen tumors, causing 'image hallucination' and risking misdiagnosis. We hypothesize such hallucinations result from local OOD regions in the conditional images. By partitioning the OOD region and conducting separate generations, hallucinations can be alleviated, and we verify this with motivational studies in several applications. From this, we propose a training-free diffusion framework that reduces hallucination by performing multiple \textit{Local Diffusion} processes. Our approach involves OOD estimation followed by two diffusion modules: a 'branching' module for local image generations from OOD estimations, and a 'fusion' module to integrate these predictions into a full image cohesively. These modules adapt to each testing dataset by updating an auxiliary classifier. Our evaluation shows our method improves baseline models quantitatively and qualitatively across different datasets. It also works well with various pre-trained diffusion models as a plug-and-play option.


# 155
Adversarial Diffusion Distillation

Axel Sauer · Dominik Lorenz · Andreas Blattmann · Robin Rombach

We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1--4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs, Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.


# 158
Pyramid Diffusion for Fine 3D Large Scene Generation

Yuheng Liu · Xinke Li · Xueting Li · Lu Qi · Chongshou Li · Ming-Hsuan Yang

Diffusion models have shown remarkable results in generating 2D images and small-scale 3D objects. However, their application to the synthesis of large-scale 3D scenes has been rarely explored. This is mainly due to the inherent complexity and bulky size of 3D scenery data, particularly outdoor scenes, and the limited availability of comprehensive real-world datasets, which makes training a stable scene diffusion model challenging. In this work, we explore how to effectively generate large-scale 3D scenes using the coarse-to-fine paradigm. We introduce a framework, the Pyramid Discrete Diffusion model (PDD), which employs scale-varied diffusion models to progressively generate high-quality outdoor scenes. Experimental results of PDD demonstrate our successful exploration in generating 3D scenes both unconditionally and conditionally. We further showcase the data compatibility of the PDD model, due to its multi-scale architecture: a PDD model trained on one dataset can be easily fine-tuned with another dataset. The source codes and trained models will be made available to the public.


# 119
Strong Double Blind
Controlling the World by Sleight of Hand

Sruthi Sudhakar · Ruoshi Liu · Basile Van Hoorick · Carl Vondrick · Richard Zemel

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform fine-grained object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose learning to model interactions through a novel form of visual conditioning: hands. Hands are a natural way to specify control through actions such as grasping, pulling, pushing, etc. Given an input image and a representation of a hand interacting with the scene, our approach, CoSHAND, synthesizes a depiction of what the scene would look like after the interaction has occurred. We show that CoSHAND is able to recover the dynamics of manipulation by learning from large amounts of unlabeled videos of human hands interacting with objects, and leveraging internet-scale latent diffusion model priors. The model demonstrates strong capabilities on a variety of actions and object types beyond the dataset, and the ability to generate multiple possible futures depending on the actions performed. CoSHAND is also able to generalize zero-shot to tasks where the agent is a robot arm rather than a human hand. Our hand-conditioned model has several exciting applications in robotic planning and augmented or virtual reality.


# 4
Strong Double Blind
Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning

Jinglin Liang · Jin Zhong · Hanlin Gu · Zhongqi Lu · Xingxing Tang · Gang Dai · Shuangping Huang · Lixin Fan · Qiang Yang

Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL. Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the model. To address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each category, searching the corresponding input conditions for each category within the model's input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier's domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Extensive experiments demonstrate that our method significantly outperforms existing baselines.


# 159
OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

Runyi Li · Xuhan SHENG · Weiqi Li · Jian Zhang

Omnidirectional images (ODIs) are commonly used in real-world visual tasks, and high-resolution ODIs help improve the performance of related visual tasks. Most existing super-resolution methods for ODIs use end-to-end learning strategies, resulting in inferior realness of generated images and a lack of effective out-of-domain generalization capabilities in training methods. Image generation methods represented by diffusion model provide strong priors for visual tasks and have been proven to be effectively applied to image restoration tasks. Leveraging the image priors of the Stable Diffusion (SD) model, we achieve omnidirectional image super-resolution with both fidelity and realness, dubbed as OmniSSR. Firstly, we transform the equirectangular projection (ERP) images into tangent projection (TP) images, whose distribution approximates the planar image domain. Then, we use SD to iteratively sample initial high-resolution results. At each denoising iteration, we further correct and update the initial results using the proposed Octadecaplex Tangent Information Interaction (OTII) and Gradient Decomposition (GD) technique to ensure better consistency. Finally, the TP images are transformed back to obtain the final high-resolution results. Our method is zero-shot, requiring no training or fine-tuning. Experiments of our method on two benchmark datasets demonstrate the effectiveness of our proposed method.


# 148
MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment

Kanglei Zhou · Liyuan Wang · Xingxing Zhang · Hubert P. H. Shum · Frederick W. B. Li · Jianguo Li · Xiaohui Liang

Action Quality Assessment (AQA) evaluates diverse skills but models struggle with non-stationary data. We propose Continual AQA (CAQA) to refine models using sparse new data. Feature replay preserves memory without storing raw inputs. However, the misalignment between static old features and the dynamically changing feature manifold causes severe catastrophic forgetting. To address this novel problem, we propose Manifold-Aligned Graph Regularization (MAGR), which first aligns deviated old features to the current feature manifold, ensuring representation consistency. It then constructs a graph jointly arranging old and new features aligned with quality scores. Experiments show MAGR outperforms recent strong baselines with up to 6.56%, 5.66%, 15.64%, and 9.05% correlation gains on the MTL-AQA, FineDiving, UNLV-Dive, and JDM-MSA split datasets, respectively. This validates MAGR for continual assessment challenges arising from non-stationary skill variations.


# 151
Strong Double Blind
C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Rongchang Li · Zhenhua Feng · Tianyang Xu · Linze Li · Xiao-Jun Wu · Muhammad Awais · Sara Atito · Josef Kittler

Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring, so-called, compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variation between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://anonymous.4open.science/r/C2C_anonymous-51F1.


# 150
Strong Double Blind
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Mohaiminul Islam · Tushar Nagarajan · Huiyu Wang · FU-JEN CHU · Kris Kitani · Gedas Bertasius · Xitong Yang

Goal-oriented planning, or anticipating a series of actions that transition an agent from its current state to a predefined objective, is crucial for developing intelligent assistants aiding users in daily procedural tasks. The problem presents significant challenges due to the need for comprehensive knowledge of temporal and hierarchical task structures, as well as strong capabilities in reasoning and planning. To achieve this, prior work typically relies on extensive training on the target dataset, which often results in significant dataset bias and a lack of generalization to unseen tasks. In this work, we introduce VidAssist, an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos. VidAssist leverages large language models (LLMs) as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets. Moreover, VidAssist employs a breadth-first search algorithm for optimal plan generation, in which a composite of value functions designed for goal-oriented planning are utilized to assess the predicted actions at each step. Extensive experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups, e.g., visual planning for assistance (VPA) and procedural planning (PP) and achieves remarkable performance in zero-shot and few-shot setups. Specifically, in the COIN dataset, the few-shot VidAssist outperforms the prior state-of-the-art fully-supervised method by +7.7% success rate in the VPA task and +4.81% success rate in the PP task for planning horizon 4.


# 145
Towards Neuro-Symbolic Video Understanding

Minkyu Choi · Harsh Goel · Mohammad Omama · Yunhao Yang · Sahil Shah · Sandeep Chinchali

The unprecedented surge in video data production in recent years necessitates efficient tools for extracting meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for their failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.


# 153
DEVIAS: Learning Disentangled Video Representations of Action and Scene

Kyungho Bae · Youngrae Kim · Geo Ahn · Jinwoo Choi

Video recognition models often learn scene-biased action representation due to the spurious correlation between actions and scenes in the training data. Such models show poor performance when the test data consists of videos with unseen action-scene combinations. Although Scene-debiased action recognition models might address the issue, they often overlook valuable scene information in the data. To address this challenge, we propose to learn Disentangled VIdeo representations of Action and Scene (DEVIAS), for more holistic video understanding. We propose an encoder-decoder architecture to learn disentangled action and scene representations with a single model. The architecture consists of a disentangling encoder (DE), an action mask decoder (AMD), and a prediction head. The key to achieving the disentanglement is employing both DE and AMD during training time. The DE uses the slot attention mechanism to learn disentangled action and scene representations. For further disentanglement, an AMD learns to predict action masks, given an action slot. With the resulting disentangled representations, we can achieve robust performance across diverse scenarios, including both seen and unseen action-scene combinations. We rigorously validate the proposed method on the UCF-101, Kinetics-400, and HVU datasets for the seen, and the SCUBA, HAT, and HVU datasets for unseen action-scene combination scenarios. Furthermore, DEVIAS provides flexibility to adjust the emphasis on action or scene information depending on dataset characteristics for downstream tasks. DEVIAS shows favorable performance in various downstream tasks: Diving48, Something-Something-V2, UCF-101, and ActivityNet. We plan to release the code upon acceptance to facilitate future research.


# 147
Strong Double Blind
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

Ishan Rajendrakumar Dave · Fabian Caba · Shah Mubarak · Simon Jenni

Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets.


# 149
Strong Double Blind
E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Peijun Bao · Zihao Shao · Wenhan Yang · Boon Poh Ng · Alex Kot

Spatio-temporal video grounding aims to localize the spatio-temporal tube in a video according to the given language query. To eliminate the annotation costs, we make a first exploration to tackle spatio-temporal video grounding in a zero-shot manner. Our method dispenses with the need for any training videos or annotations; instead, it localizes the target object by leveraging large visual-language models and optimizing within the video and text query during the test time. To enable spatio-temporal comprehension, we introduce a multimodal modulation that integrates the spatio-temporal context into both visual and textual representation. On the visual side, we devise a context-based visual modulation that amplifies the visual representation by propagation and aggregation of the contextual semantics. Concurrently, on the textual front, we propose a prototype-based textual modulation to refine the textual representations using visual prototypes, effectively mitigating the cross-modal discrepancy. In addition, to overcome the interleaved spatio-temporal dilemma, we propose an expectation maximization (EM) framework to optimize the process of temporal relevance estimation and spatial region identification in an alternating way. Comprehensive experiments validate that our zero-shot approach achieves superior performance in comparison to several state-of-the-art methods with stronger supervision. We will make our code publicly accessible online.


# 157
Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

Remy Sabathier · David Novotny · Niloy Mitra

We present a method to build animatable dog avatars from monocular videos. This is challenging as animals display a range of (unpredictable) non-rigid movements and have a variety of appearance details (e.g., fur, spots, tails). We develop an approach that links the video frames via a 4D solution that jointly solves for animal's pose variation, and its appearance (in a canonical pose). To this end, we significantly improve the quality of template-based shape fitting by endowing the SMAL parametric model with Continuous Surface Embeddings (CSE), which brings image-to-mesh reprojection constaints that are denser, and thus stronger, than the previously used sparse semantic keypoint correspondences. To model appearance, we propose an implicit duplex-mesh texture that is defined in the canonical pose, but can be deformed using SMAL pose coefficients and later rendered to enforce a photometric compatibility with the input video frames. On the challenging CoP3D and APTv2 datasets, we demonstrate superior results (both in terms of pose estimates and predicted appearance) to existing template-free (RAC) and template-based approaches (BARC, BITE)


# 154
LongVLM: Efficient Long Video Understanding via Large Language Models

Yuetian Weng · Mingfei Han · Haoyu He · Xiaojun Chang · Bohan Zhuang

Empowered by Large Language Models (LLMs), recent advancements in VideoLLMs have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast amount of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding in videos due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a straightforward yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each local segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples demonstrate that our model produces more precise responses for long videos understanding.


# 117
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Charig Yang · Weidi Xie · Andrew ZISSERMAN

Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple video settings covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state of the art on standard benchmarks for ordering a set of images.


# 199
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

Renjie Pi · Tianyang Han · Wei Xiong · Jipeng ZHANG · Runtao Liu · Rui Pan · Tong Zhang

Multimodal Large Language Models (MLLMs) excel in generating responses based on visual inputs. However, they often suffer from a bias towards generating responses similar to their pretraining corpus, overshadowing the importance of visual information. We treat this bias as a "preference" for pretraining statistics, which hinders the model's grounding in visual input. To mitigate this issue, we propose Bootstrapped Preference Optimization (BPO), which conducts preference learning with datasets containing negative responses bootstrapped from the model itself. Specifically, we propose the following two strategies: 1) using distorted image inputs to the MLLM for eliciting responses that contain signified pretraining bias; 2) leveraging text-based LLM to explicitly inject erroneous but common elements into the original response. Those undesirable responses are paired with original annotated responses from the datasets to construct the preference dataset, which is subsequently utilized to perform preference learning. Our approach effectively suppresses pretrained LLM bias, enabling enhanced grounding in visual inputs. Extensive experimentation demonstrates significant performance improvements across multiple benchmarks, advancing the state-of-the-art in multimodal conversational systems.


# 183
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

Ronglai Zuo · Fangyun Wei · Zenggui Chen · Brian Mak · Jiaolong Yang · Xin Tong

The objective of this paper is to develop a functional system for translating spoken languages into sign languages, referred to as Spoken2Sign translation. The Spoken2Sign task is orthogonal and complementary to traditional sign language to spoken language (Sign2Spoken) translation. To enable Spoken2Sign translation, we present a simple baseline consisting of three steps: 1) creating a gloss-video dictionary using existing Sign2Spoken benchmarks; 2) estimating a 3D sign for each sign video in the dictionary; 3) training a Spoken2Sign model, which is composed of a Text2Gloss translator, a sign connector, and a rendering module, with the aid of the yielded gloss-3D sign dictionary. The translation results are then displayed through a sign avatar. As far as we know, we are the first to present the Spoken2Sign task in an output format of 3D signs. In addition to its capability of Spoken2Sign translation, we also demonstrate that two by-products of our approach—3D keypoint augmentation and multi-view understanding—can assist in keypoint-based sign language understanding. Code and models will be released to facilitate future research.


# 172
Strong Double Blind
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Chen Ju · Haicheng Wang · Haozhe Cheng · Xu Chen · Zhonghua Zhai · Weilin Huang · Jinsong Lan · Shuai Xiao · Bo Zheng

Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two crucial factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without retraining and trivial engineering efforts. On multiple VLMs benchmarks, we fully experiment to reveal good acceleration of Turbo, under negligible performance drop.


# 146
Strong Double Blind
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

Zikai Huang · Xuemiao Xu · Cheng Xu · Huaidong Zhang · Chenxi Zheng · Jing Qin · Shengfeng He

Dance, as an art form, fundamentally hinges on the precise synchronization with musical beats. However, achieving aesthetically pleasing dance sequences from music is challenging, with existing methods often falling short in controllability and beat alignment. To address these shortcomings, this paper introduces Beat-It, a novel framework for beat-specific, key pose-guided dance generation. Unlike prior approaches, Beat-It uniquely integrates explicit beat awareness and key pose guidance, effectively resolving two main issues: the misalignment of generated dance motions with musical beats, and the inability to map key poses to specific beats, critical for practical choreography. Our approach disentangles beat conditions from music using a nearest beat distance representation and employs a hierarchical multi-condition fusion mechanism. This mechanism seamlessly integrates key poses, beats, and music features, mitigating condition conflicts and offering rich, multi-conditioned guidance for dance generation. Additionally, a specially designed beat alignment loss ensures the generated dance movements remain in sync with the designated beats. Extensive experiments confirm Beat-It's superiority over existing state-of-the-art methods in terms of beat alignment and motion controllability. Qualitative results of our method can be found on our anonymous website.


# 190
BRAVE: Broadening the visual encoding of vision-language models

Oguzhan Fatih Kar · Alessio Tonioni · Petra Poklukar · Achin Kulshrestha · Amir Zamir · Federico Tombari

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. ``blindness'' to certain image features, visual hallucination, etc. To address these issues, we study broadening of the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.


# 182
MMBENCH: Is Your Multi-Modal Model an All-around Player?

Yuan Liu · Haodong Duan · Yuanhan Zhang · Bo Li · Songyang Zhang · Wangbo Zhao · Yike Yuan · Jiaqi Wang · Conghui He · Ziwei Liu · Kai Chen · Dahua Lin

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area.


# 202
Strong Double Blind
uCAP: An Unsupervised Prompting Method for Vision-Language Models

A. Tuan Nguyen · Kai Sheng Tai · Bor-Chun Chen · Satya Narayan Shukla · Hanchao Yu · Philip Torr · Taipeng Tian · Ser-Nam Lim

This paper addresses a significant limitation that prevents Contrastive Language-Image Pretrained Models (CLIP) from achieving optimal performance on downstream image classification tasks. The key problem with CLIP-style zero-shot classification is that it requires domain-specific context in the form of prompts to better align the class descriptions to the downstream data distribution. In particular, prompts for vision-language models are domain-level texts (e.g., ``a centered satellite image of ...'') which, together with the class names, are fed into the text encoder to provide more context for the downstream dataset. These prompts are typically manually tuned, which is time consuming and often sub-optimal. To overcome this bottleneck, this paper proposes uCAP, a method to automatically learn domain-specific prompts/contexts using only unlabeled in-domain images. We achieve this by modeling the generation of images given the class names and a domain-specific prompt with an unsupervised likelihood distribution, and then performing inference of the prompts. We validate the proposed method across various models and datasets, showing that uCAP consistently outperforms manually tuned prompts and related baselines on the evaluated datasets: ImageNet, CIFAR-10, CIFAR-100, OxfordPets (up to 2\%), SUN397 (up to 5\%), and Caltech101 (up to 3\%).


# 211
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

Wonjae Kim · Sanghyuk Chun · Taekyung Kim · Dongyoon Han · Sangdoo Yun

In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity $\epsilon_{i}$ can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score.


# 168
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen · Haozhe Zhao · Tianyu Liu · Shuai Bai · Junyang Lin · Chang Zhou · Baobao Chang

In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45\% reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code will be released upon acceptance.


# 169
GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang · Hao Tang · Li Jiang · Shaoshuai Shi · Muhammad Ferjad Naeem · Hongsheng LI · Bernt Schiele · Liwei Wang

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g., GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code will be available.


# 191
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

Shouwei Ruan · Yinpeng Dong · Liu Hanqing · Yao Huang · Hang Su · Xingxing Wei

Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable success in computer vision and particularly demonstrated superior robustness to distribution shifts of 2D images. However, their robustness under 3D viewpoint variations is still limited, which can hinder the development for real-world applications. This paper successfully addresses this concern while keeping VLPs' original performance by breaking through two primary obstacles: 1) the scarcity of training data and 2) the suboptimal fine-tuning paradigms. To combat data scarcity, we build the Multi-View Caption (MVCap) dataset --- a comprehensive collection of over four million multi-view image-text pairs across more than 100K objects, providing more potential for VLP models to develop generalizable viewpoint-invariant representations. To address the limitations of existing paradigms in performance trade-offs and training efficiency, we design a novel fine-tuning framework named Omniview-Tuning (OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy, which effectively aligns representations of identical objects from diverse viewpoints without causing overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient manner, leading to minimal computational cost. Extensive experiments on various VLP models with different architectures validate that OVT significantly improves the models' resilience to viewpoint shifts and keeps the original performance, establishing a pioneering standard for boosting viewpoint invariance of VLP models.


# 268
Strong Double Blind
Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°

Yuxiao He · Yiyu Zhuang · Yanwen Wang · Yao Yao · Siyu Zhu · Xiaoyu Li · Qi Zhang · Xun Cao · Hao Zhu

Creating a 360° parametric model of a human head is a very challenging task. While recent advancements have demonstrated the efficacy of leveraging synthetic data for building such parametric head models, their performance remains inadequate in crucial areas such as expression-driven animation, hairstyle editing, and text-based modifications. In this paper, we build a dataset of artist-designed high-fidelity human heads and propose to create a novel parametric 360-degree renderable parametric head model from it. Our scheme decouples the facial motion/shape and facial appearance, which are represented by a classic parametric 3D mesh model and an attached neural texture, respectively. We further propose a training method for decompositing hairstyle and facial appearance, allowing free-swapping of the hairstyle. A novel inversion fitting method is presented based on single image input with high generalization and fidelity. To the best of our knowledge, our model is the first parametric 3D full-head that achieves $360^\circ$ free-view synthesis, image-based fitting, appearance editing, and animation within a single model. Experiments show that facial motions and appearances are well disentangled in the parametric space, leading to SOTA performance in rendering and animating quality.


# 271
Tri^{2}-plane: Thinking Head Avatar via Feature Pyramid

Luchuan Song · Pinxin Liu · Lele Chen · Guojun Yin · Chenliang Xu

Recent years have witnessed considerable achievements in facial avatar reconstruction with neural volume rendering. Despite notable advancements, the reconstruction of complex and dynamic head movements from monocular videos still suffers from capturing and restoring fine-grained details. In this work, we propose a novel approach, named Tri^2-plane, for monocular photo-realistic volumetric head avatar reconstructions. Distinct from the existing works that rely on a single tri-plane deformation field for dynamic facial modeling, the proposed Tri^2-plane leverages the principle of feature pyramids and three top-to-down lateral connections tri-planes for details improvement. It samples and renders facial details at multiple scales, transitioning from the entire face to specific local regions and then to even more refined sub-regions. Moreover, we incorporate a camera-based geometry-aware sliding window method as an augmentation in training, which improves the robustness beyond the canonical space, with a particular improvement in cross-identity generation capabilities. Experimental outcomes indicate that the Tri^2-plane not only surpasses existing methodologies but also achieves superior performance across quantitative and qualitative assessments.


# 277
Strong Double Blind
AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

Feichi Lu · Zijian Dong · Jie Song · Otmar Hilliges

Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape region of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.


# 270
AnimateMe: 4D Facial Expressions via Diffusion Models

Dimitrios Gerogiannis · Foivos Paraperas Papantoniou · Rolandos Alexandros Potamias · Alexandros Lattas · Stylianos Moschoglou · Stylianos Ploumpis · Stefanos Zafeiriou

The field of photorealistic 3D avatar reconstruction and generation has garnered significant attention in recent years; however, animating such avatars remains challenging. Recent advances in diffusion models have notably enhanced the capabilities of generative models in 2D animation. In this work, we directly utilize these models within the 3D domain to achieve controllable and high-fidelity 4D facial animation. By integrating the strengths of diffusion processes and geometric deep learning, we employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space and enabling the generation of 3D facial expressions. This facilitates the generation of facial deformations through a mesh-diffusion-based model. Additionally, to ensure temporal coherence in our animations, we propose a consistent noise sampling method. Under a series of both quantitative and qualitative experiments, we showcase that the proposed method outperforms prior work in 4D expression synthesis by generating high-fidelity extreme expressions. Furthermore, we applied our method to textured 4D facial expression generation, implementing a straightforward extension that involves training on a large-scale textured 4D facial expression database.


# 318
Strong Double Blind
Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes

Siqi Yang · Zhaojun Huang · Yakun Chang · Bin Fan · Zhaofei Yu · Boxin Shi

The spike camera continuously records scene radiance with high-speed, high dynamic range, and low data redundancy properties, as a promising replacement for frame-based high-speed cameras. Previous methods for reconstructing color videos from monochromatic spikes are constrained in capturing full-temporal color information due to their reliance on compensating colors from low-speed RGB frames. Applying a Bayer-pattern color filter array to the spike sensor yields mosaicked chromatic spikes, which complicates noise distribution in high-speed conditions. By validating that the noise of short-term frames follows a zero-mean distribution, we leverage this hypothesis to develop a self-supervised denoising module trained exclusively on real-world data. Although noise is reduced in short-term frames, the long-term accumulation of incident photons is still necessary to construct HDR frames. Therefore, we introduce a progressive warping module to generate pseudo long-term exposure frames. This approach effectively mitigates motion blur artifacts in high-speed conditions. Integrating these modules forms a real-data-driven reconstruction method for mosaicked chromatic spikes. Extensive experiments conducted on both synthetic and real-world data demonstrate that our approach is effective in reconstructing 2000FPS color HDR videos with significantly reduced noise and motion blur compared to existing methods.


# 321
Strong Double Blind
Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography

Kailai Zhou · Lijing Cai · Yibo Wang · Mengya Zhang · Bihan Wen · Qiu Shen · Xun Cao

The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spectral decomposition model guided enhancement framework, which consists of two steps: joint decomposition and priors-guided enhancement. Firstly, we leverage the complementarity between RGB and Low-resolution Multi-Spectral Images (Lr-MSI) to predict shading, reflectance, and material semantic priors. Subsequently, these priors are seamlessly integrated into the established HDRNet to promote dynamic range enhancement, color mapping, and grid expert learning, respectively. Additionally, we construct a high-quality Mobile-Spec dataset to support our research, and our experiments validate the effectiveness of Lr-MSI in the tone enhancement task. This work aims to establish a solid foundation for advancing spectral vision in mobile photography.


# 311
Strong Double Blind
Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats

Mingyang Xie · Haoming Cai · Sachin Shah · Yiran Xu · Brandon Y Feng · Jia-Bin Huang · Christopher Metzler

We introduce a simple yet effective approach for separating transmitted and reflected light. Our key insight is that the powerful novel view synthesis capabilities provided by modern inverse rendering methods (e.g.,~Gaussian splatting) allow one to perform flash/no-flash reflection separation using {\em unpaired measurements}---this relaxation dramatically simplifies image acquisition over conventional paired flash/no-flash reflection separation methods. Through extensive real-world experiments, we demonstrate our method, Flash-Splat, accurately reconstructs both transmitted and reflected scenes in 3D. Our method outperforms existing 3D reflection separation methods, which do not leverage illumination control, by a large margin.


# 317
Strong Double Blind
Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM

Jonathan Sauder · Devis TUIA

Underwater scenes are challenging for computer vision methods due to color degradation caused by the water column and detrimental lighting effects such as caustic caused by sunlight refracting on a wavy surface. These challenges impede widespread use of computer vision tools that could aid in ecological surveying of underwater environments or in industrial applications. Existing algorithms for alleviating caustics and descattering the image to recover colors are often impractical to implement due to the need for ground-truth training data, the necessity for successful alignment of an image within a 3D scene, or other assumptions that are infeasible in practice. In this paper, we propose a solution to tackle those problems in underwater computer vision: our method is based on two neural networks: CausticsNet, for single-image caustics removal, and BackscatterNet, for backscatter removal. Both neural networks are trained using an objective formulated with the aid of self-supervised monocular SLAM on a collection of underwater videos. Thus, our method does not requires any ground-truth color images or caustics labels, and corrects images in real-time. We experimentally demonstrate the fidelity of our caustics removal method, performing similarly to state-of-the-art supervised methods, and show that the color restoration and caustics removal lead to better downstream performance in Structure-from-Motion image keypoint matching than a wide range of methods.


# 309
Strong Double Blind
Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis

Qian Chen · Shihao Shu · Xiangzhi Bai

Novel-view synthesis based on visible light has been extensively studied. In comparison to visible light imaging, thermal infrared imaging offers the advantage of all-weather imaging and strong penetration, providing increased possibilities for reconstruction in nighttime and adverse weather scenarios. However, thermal infrared imaging is influenced by physical characteristics such as atmospheric transmission effects and thermal conduction, hindering the precise reconstruction of intricate details in thermal infrared scenes, manifesting as issues of floaters and indistinct edge features in synthesized images. To address these limitations, this paper introduces a physics-induced 3D Gaussian splatting method named Thermal3D-GS. Thermal3D-GS begins by modeling atmospheric transmission effects and thermal conduction in three-dimensional media using neural networks. Additionally, a temperature consistency constraint is incorporated into the optimization objective to enhance the reconstruction accuracy of thermal infrared images. Furthermore, to validate the effectiveness of our method, the first large-scale benchmark dataset for this field named Thermal Infrared Novel-view Synthesis Dataset (TI-NSD) is created. This dataset comprises 20 authentic thermal infrared video scenes, covering indoor, outdoor, and UAV(Unmanned Aerial Vehicle) scenarios, totaling 6,664 frames of thermal infrared image data. Based on this dataset, this paper experimentally verifies the effectiveness of Thermal3D-GS. The results indicate that our method outperforms the baseline method with a 3.03 dB improvement in PSNR and significantly addresses the issues of floaters and indistinct edge features present in the baseline method. Our dataset and codebase will be released soon.


# 306
Strong Double Blind
Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending

Delong Wu · Hao Zhu · Qi Zhang · You Li · Xun Cao · Zhan Ma

Implicit Neural Representation (INR) has become a popular method for representing visual signals (\eg, 2D images and 3D scenes), showing promising results in various downstream applications. Given its potential as a medium for visual signals, exploring the development of a neural blending method that utilizes INRs is a natural progression. Neural blending involves merging two INRs to create a new INR that encapsulates information from both original representations. A direct approach involves applying traditional image editing methods to the INR rendering process. However, this method often results in blending distortions, artifacts, and color shifts, primarily due to the discretization of the underlying discrete pixel grid and the introduction of boundary conditions for solving variational problems. To tackle this issue, we introduce the Neural Poisson Solver, a plug-and-play and universally applicable framework across different signal dimensions for blending visual signals represented by INRs. Our Neural Poisson Solver offers a variational problem-solving approach based on the continuous Poisson equation, which has demonstrated exceptional performance across various domains. Specifically, we propose a gradient-guided neural solver to represent the solution process of the variational problem, refining the target signal to achieve natural blending results. We also develop a Poisson equation-based loss and optimization scheme to train our solver, ensuring it effectively blends the input INR scenes while preserving their inherent structure and semantic content. Our method's independence from additional prior knowledge allows for easy adaptation across different task categories, underscoring its remarkable versatility. Extensive experiments demonstrate our method's robust capabilities across various dimensions and blending tasks.


# 304
Strong Double Blind
UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation

Shuang Wu · Songlin Tang · Guangming Lu · Jianzhuang Liu · Wenjie Pei

Typical inverse rendering methods focus on learning implicit neural scene representations by modeling the geometry, materials and illumination separately, which entails significant computations for optimization. In this work we design a Unified Voxelization framework for explicit learning of scene representations, dubbed UniVoxel, which allows for efficient modeling of the geometry, materials and illumination jointly, thereby accelerating the inverse rendering significantly. To be specific, we propose to encode a scene into a latent volumetric representation, based on which the geometry, materials and illumination can be readily learned via lightweight neural networks in a unified manner. Particularly, an essential design of UniVoxel is that we leverage local Spherical Gaussians to represent the incident light radiance, which enables the seamless integration of modeling illumination into the unified voxelization framework. Such novel design enables our UniVoxel to model the joint effects of direct lighting, indirect lighting and light visibility efficiently without expensive multi-bounce ray tracing. Extensive experiments on multiple benchmarks covering diverse scenes demonstrate that UniVoxel boosts the optimization efficiency significantly compared to other methods, reducing the per-scene training time from hours to 18 minutes, while achieving favorable reconstruction quality. Code will be released.


# 292
City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web

Kaiwen Song · Xiaoyi Zeng · Chenqu Ren · Juyong Zhang

Existing neural radiance field-based methods can achieve real-time rendering of small scenes on the web platform. However, extending these methods to large-scale scenes still poses significant challenges due to limited resources in computation, memory, and bandwidth. In this paper, we propose City-on-Web, the first method for real-time rendering of large-scale scenes on the web. We propose a block-based volume rendering method to guarantee 3D consistency and correct occlusion between blocks, and introduce a Level-of-Detail strategy combined with dynamic loading/unloading of resources to significantly reduce memory demands. Our system achieves real-time rendering of large-scale scenes at approximately 32FPS with RTX 3060 GPU on the web and maintains rendering quality comparable to the current state-of-the-art novel view synthesis methods.


# 302
Strong Double Blind
Few-shot NeRF by Adaptive Rendering Loss Regularization

Qingshan Xu · Xuanyu Yi · Jianyao Xu · Wenbing Tao · Yew Soon Ong · Hanwang Zhang

Novel view synthesis with sparse inputs poses great challenges to Neural Radiance Field (NeRF). Recent works demonstrate that the frequency regularization of Positional Encoding (PE) can achieve promising results for this task. In this work, we reveal that there exists an inconsistency between the frequency regularization of PE and rendering loss. This prevents few-shot NeRF from synthesizing higher-quality novel views. To mitigate this inconsistency, we propose Adaptive Rendering loss regularization for few-shot NeRF, dubbed AR-NeRF. Specifically, we present a two-phase rendering supervision and an adaptive rendering loss weight learning strategy to align the frequency relationship between PE and 2D-pixel supervision. In this way, AR-NeRF can learn global structures better in the early training phase and adaptively learn local details throughout the training process. Extensive experiments show that our AR-NeRF achieves state-of-the-art performance on different datasets, including object-level and complex scenes. Our code will be released upon publication.


# 308
BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting

Lingzhe Zhao · Peng Wang · Peidong Liu

While neural rendering has demonstrated impressive capabilities in 3D scene reconstruction and novel view synthesis, it heavily relies on high-quality sharp images and accurate camera poses. Numerous approaches have been proposed to train Neural Radiance Fields (NeRF) with motion-blurred images, commonly encountered in real-world scenarios such as low-light or long-exposure conditions. However, the implicit representation of NeRF struggles to accurately recover intricate details from severely motion-blurred images and cannot achieve real-time rendering. In contrast, recent advancements in 3D Gaussian Splatting achieve high-quality 3D scene reconstruction and real-time rendering by explicitly optimizing point clouds as Gaussian spheres. In this paper, we introduce a novel approach, named BAD-Gaussians (Bundle Adjusted Deblur Gaussian Splatting), which leverages explicit Gaussian representation and handles severe motion-blurred images with inaccurate camera poses to achieve high-quality scene reconstruction. Our method models the physical image formation process of motion-blurred images and jointly learns the parameters of Gaussians while recovering camera motion trajectories during exposure time. In our experiments, we demonstrate that BAD-Gaussians not only achieves superior rendering quality compared to previous state-of-the-art deblur neural rendering methods on both synthetic and real datasets but also enables real-time rendering capabilities.


# 301
Strong Double Blind
Generalizable Human Gaussians for Sparse View Synthesis

Youngjoong Kwon · Baole Fang · Yixing Lu · Haoye Dong · Cheng Zhang · Francisco Vicente Carrasco · Albert Mosella-Montoro · Jianjin Xu · Shingo J Takagi · Daeil Kim · Aayush Prakash · Fernando de la Torre

Recent progress in neural rendering have brought forth pioneering methods, such as NeRF and Gaussian Splatting, which revolutionize view rendering across various domains like AR/VR, gaming, and content creation. While these methods excel at interpolating within the training data, the challenge of generalizing to new scenes and objects from very sparse views persists. Specifically, modeling 3D humans from sparse views presents formidable hurdles due to the inherent complexity of human geometry, resulting in inaccurate reconstructions of geometry and textures. To tackle this challenge, this paper leverages recent advancements in Gaussian splatting and introduces a new method to learn generalizable human Gaussians that allows photorealistic and accurate view-rendering of a new human subject from a limited set of sparse views in a feed-forward manner. A pivotal innovation of our approach involves reformulating the learning of 3D Gaussian parameters into a regression process defined on the 2D UV space of a human template, which allows leveraging the strong geometry prior and the advantages of 2D convolutions. Our method outperforms recent methods on both within-dataset generalization as well as cross-dataset generalization settings.


# 303
Strong Double Blind
Invertible Neural Warp for NeRF

Shin-Fang Chng · Ravi Garg · Hemanth Saratchandran · Simon Lucey

This paper tackles the simultaneous optimization of pose and Neural Radiance Fields (NeRF). Departing from the conventional practice of using explicit global representations for camera pose, we introduce the Invertible Neural Warp, a novel overparameterized representation for camera pose. Specifically, we establish that invertibility is a crucial property when using Multi-layer Perceptrons (MLPs) for ray transformation. This work proposes utilizing an Invertible Neural Network, coupled with a geometry-informed constraint to achieve enhanced optimization convergence. Experiments on the LLFF and DTU datasets demonstrate the effectiveness of our approach in terms of pose estimation and high-fidelity reconstruction compared to existing well-established approaches.


# 305
Strong Double Blind
PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects

Guangcheng Chen · Yicheng He · Li He · Hong Zhang

Neural implicit surface reconstruction has achieved remarkable progress recently. Despite resorting to complex radiance modeling, state-of-the-art methods still struggle with textureless and specular surfaces. Different from RGB images, polarization images can provide direct constraints on the azimuth angles of the surface normals. In this paper, we present PISR, a novel method that utilizes a geometrically accurate polarimetric loss to refine shape independently of appearance. In addition, PISR smooths surface normals in image space to eliminate severe shape distortions and leverages the hash-grid-based neural signed distance function to accelerate the reconstruction. Experimental results demonstrate that PISR achieves higher accuracy and robustness, with an L1 chamfer distance of 0.5 mm and an F-score of 99.5% at 1mm, while converging 4~ 30x faster than previous polarimetric surface reconstruction methods. The source code and dataset will be released upon acceptance of this paper.


# 298
Strong Double Blind
Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images

Xinlin Ren · Chenjie Cao · Yanwei Fu · Xiangyang Xue

Recent advancements in Neural Surface Reconstruction (NSR) have significantly improved multi-view reconstruction when coupled with volume rendering. However, relying solely on photometric consistency in image space falls short in addressing complexities posed by real-world data, including occlusions and non-Lambertian surfaces. To tackle these challenges, we propose an investigation into feature-level consistent loss, aiming to harness valuable feature priors from diverse pretext visual tasks and overcome current limitations. It is crucial to note the existing gap in determining the most effective pretext visual task for enhancing NSR. In this study, we comprehensively explore multi-view feature priors from seven pretext visual tasks, comprising thirteen methods. Our main goal is to strengthen NSR training by considering a wide range of possibilities. Additionally, we examine the impact of varying feature resolutions and evaluate both pixel-wise and patch-wise consistent losses, providing insights into effective strategies for improving NSR performance. By incorporating pre-trained representations from MVSFormer and QuadTree, our approach can generate variations of MVS-NeuS and Match-NeuS, respectively. Our results, analyzed on DTU and EPFL datasets, reveal that feature priors from image matching and multi-view stereo outperform other pretext tasks. Moreover, we discover that extending patch-wise photometric consistency to the feature level surpasses the performance of pixel-wise approaches. These findings underscore the effectiveness of these techniques in enhancing NSR outcomes.


# 299
Strong Double Blind
SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization

Yiyang Chen · Siyan Dong · Xulong Wang · Lulu Cai · Youyi Zheng · Yanchao Yang

3D surface reconstruction from images is essential for numerous applications. Recently, Neural Radiance Fields (NeRFs) have emerged as a promising framework for 3D modeling. However, NeRFs require accurate camera poses as input, and existing methods struggle to handle significantly noisy pose estimates (i.e., outliers), which are commonly encountered in real-world scenarios. To tackle this challenge, we present a novel approach that optimizes radiance fields with scene graphs to mitigate the influence of outlier poses. Our method incorporates an adaptive inlier-outlier confidence estimation scheme based on scene graphs, emphasizing images of high compatibility with the neighborhood and consistency in the rendering quality. We also introduce an effective intersection-over-union (IoU) loss to optimize the camera pose and surface geometry, together with a coarse-to-fine strategy to facilitate the training. Furthermore, we propose a new dataset containing typical outlier poses for a detailed evaluation. Experimental results on various datasets consistently demonstrate the effectiveness and superiority of our method over existing approaches, showcasing its robustness in handling outliers and producing high-quality 3D reconstructions.


# 310
Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections

Dongbin Zhang · Chuming Wang · Weitao Wang · Peihao Li · Minghan Qin · Haoqian Wang

Novel view synthesis from unconstrained in-the-wild images remains a meaningful but challenging task. The photometric variation and transient occluders in those unconstrained images make it difficult to reconstruct the original scene accurately. Previous approaches tackle the problem by introducing a global appearance feature in Neural Radiance Fields (NeRF). However, in the real world, the unique appearance of each tiny point in a scene is determined by its independent intrinsic material attributes and the varying environmental impacts it receives. Inspired by this fact, we propose Gaussian in the wild (GS-W), a method that uses 3D Gaussian points to reconstruct the scene and introduces separated intrinsic and dynamic appearance feature for each point, capturing the unchanged scene appearance along with dynamic variation like illumination and weather. Additionally, an adaptive sampling strategy is presented to allow each Gaussian point to focus on the local and detailed information more effectively. We also reduce the impact of transient occluders using a 2D visibility map. More experiments have demonstrated better reconstruction quality and details of GS-W compared to previous methods, with a 1000 times increase in rendering speed.


# 307
Strong Double Blind
3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting

Zhe Jun Tang · Tat-Jen Cham

The use of 3D Gaussians as representation of radiance fields has enabled high quality novel view synthesis at real-time rendering speed. However, the choice of optimising the outgoing radiance of each Gaussian independently as spherical harmonics results in unsatisfactory view dependent effects. In response to these limitations, our work, Factorised Tensorial Illumination for 3D Gaussian Splatting or `3iGS' improves upon 3D Gaussian Splatting (3DGS) rendering quality. Instead of optimising a single outgoing radiance parameter, 3iGS enhances 3DGS view dependent effects by expressing the outgoing radiance as a function of a local illumination field and Bidirectional Reflectance Distribution Function (BRDF) features. We optimise a continuous incident illumination field through a Tensorial Factorisation representation, while separately fine-tuning the BRDF features of each 3D Gaussian relative to this illumination field. Our methodology significantly enhances the rendering quality of specular view-dependent effects of 3DGS, while maintaining rapid training and rendering speeds. Code will be released.


# 297
HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes

Zhuopeng Li · Yilin Zhang · Chenming Wu · Jianke Zhu · Liangjun Zhang

The rapid growth of 3D Gaussian Splatting (3DGS) has revolutionized neural rendering, enabling real-time production of high-quality renderings. However, the previous 3DGS-based methods have limitations in urban scenes due to reliance on initial Structure-from-Motion(SfM) points and difficulties in rendering distant, sky and low-texture areas. To overcome these challenges, we propose a hybrid optimization method named HO-Gaussian, which combines a grid-based volume with the 3DGS pipeline. HO-Gaussian eliminates the dependency on SfM point initialization, allowing for rendering of urban scenes, and incorporates the Point Densitification to enhance rendering quality in problematic regions during training. Furthermore, we introduce Gaussian Direction Encoding as an alternative for spherical harmonics in the rendering pipeline, which enables view-dependent color representation. To adapt the 3DGS method for multi-camera systems, we introduce neural warping to enhance object consistency across different cameras. Experimental results on widely used autonomous driving datasets demonstrate that HO-Gaussian achieves photo-realistic rendering in real-time on multi-camera urban datasets.


# 293
GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering

Yanyan Li · Chenyu Lyu · Yan Di · Guangyao Zhai · Gim Hee Lee · Federico Tombari

During the Gaussian Splatting optimization process, the scene's geometry can gradually deteriorate if its structure is not deliberately preserved, especially in non-textured regions such as walls, ceilings, and furniture surfaces. This degradation significantly affects the rendering quality of novel views that deviate significantly from the viewpoints in the training data. To mitigate this issue, we propose a novel approach called GeoGaussian. Based on the smoothly connected areas observed from point clouds, this method introduces a novel pipeline to initialize thin Gaussians aligned with the surfaces, where the characteristic can be transfered to new generations through a carefully designed densification strategy. Finally, the pipeline ensures that the scene's geometry and texture are maintained through constrained optimization processes with explicit geometry constraints. Benefiting from the proposed architecture, the generative ability of 3D Gaussians is enhanced, especially in structured regions. Our proposed pipeline achieves state-of-the-art performance in novel view synthesis and geometric reconstruction, as evaluated qualitatively and quantitatively on public datasets.


# 296
EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS

Sharath Girish · Kamal Gupta · Abhinav Shrivastava

Recently, 3D Gaussian splatting (3D-GS) has gained popularity in novel-view scene synthesis. It addresses the challenges of lengthy training times and slow rendering speeds associated with Neural Radiance Fields (NeRFs). Through rapid, differentiable rasterization of 3D Gaussians, 3D-GS achieves real-time rendering and accelerated training. They, however, demand substantial memory resources for both training and storage, as they require millions of Gaussians in their point cloud representation for each scene. We present a technique utilizing quantized embeddings to significantly reduce per-point memory storage requirements and a coarse-to-fine training strategy for a faster and more stable optimization of the Gaussian point clouds. Our approach develops a pruning stage which results in scene representations with fewer Gaussians, leading to faster training times and rendering speeds for real-time rendering of high resolution scenes. We reduce storage memory by more than an order of magnitude all while preserving the reconstruction quality. We validate the effectiveness of our approach on a variety of datasets and scenes preserving the visual quality while consuming 10-20x less memory and faster training/inference speed.


# 295
End-to-End Rate-Distortion Optimized 3D Gaussian Representation

Henan Wang · Hanxin Zhu · Tianyu He · Runsen Feng · Jiajun Deng · Jiang Bian · Zhibo Chen

3D Gaussian Splatting (3DGS) has become an emerging technique with remarkable potential in 3D representation and image rendering. However, the substantial storage overhead of 3DGS significantly impedes its practical applications. In this work, we formulate the compact 3D Gaussian learning as an end-to-end Rate-Distortion Optimization (RDO) problem and propose RDO-Gaussian that can achieve flexible and continuous rate control. RDO-Gaussian addresses two main issues that exist in current schemes: 1) Different from prior endeavors that minimize the rate under the fixed distortion, we introduce dynamic pruning and entropy-constrained vector quantization (ECVQ) that optimize the rate and distortion at the same time. 2) Previous works treat the colors of each Gaussian equally, while we model the colors of different regions and materials with learnable numbers of parameters. We verify our method on both real and synthetic scenes, showcasing that RDO-Gaussian greatly reduces the size of 3D Gaussian over 40x, and surpasses existing methods in rate-distortion performance.


# 291
DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

Angelos Kratimenos · Jiahui Lei · Kostas Daniilidis

Accurately and efficiently modeling dynamic scenes and motions is considered so challenging a task due to temporal dynamics and motion complexity. To address these challenges, we propose DynMF, a compact and efficient representation that decomposes a dynamic scene into a few neural trajectories. We argue that the per-point motions of a dynamic scene can be decomposed into a small set of explicit or learned trajectories. Our carefully designed neural framework consisting of a tiny set of learned basis queried only in time allows for rendering speed similar to 3D Gaussian Splatting, surpassing 120 FPS, while at the same time, requiring only double the storage compared to static scenes. Our neural representation adequately constrains the inherently underconstrained motion field of a dynamic scene leading to effective and fast optimization. This is done by biding each point to motion coefficients that enforce the per-point sharing of basis trajectories. By carefully applying a sparsity loss to the motion coefficients, we are able to disentangle the motions that comprise the scene, independently control them, and generate novel motion combinations that have never been seen before. We can reach state-of-the-art render quality within just 5 minutes of training and in less than half an hour, we can synthesize novel views of dynamic scenes with superior photorealistic quality. Our representation is interpretable, efficient, and expressive enough to offer real-time view synthesis of complex dynamic scene motions, in monocular and multi-view scenarios.


# 278
Strong Double Blind
Human Hair Reconstruction with Strand-Aligned 3D Gaussians

Egor Zakharov · Vanessa Sklyarova · Michael J. Black · Giljoo Nam · Justus Thies · Otmar Hilliges

We introduce a new hair modeling method that uses a dual representation of classical hair strands and 3D Gaussians to produce accurate and realistic strand-based reconstructions from multi-view data. In contrast to recent approaches that leverage unstructured Gaussians to model human avatars, our method enforces a reconstruction of the hair in the form of 3D polylines, or strands. This fundamental difference allows us to use the resulting hairstyles out-of-the-box in modern computer graphics engines for editing, rendering, and simulation. To reconstruct strand-based hair from images, we introduce a new 3D line lifting method that utilizes unstructured Gaussians to represent the hairstyle's 3D surface. We use this intermediate reconstruction to generate multi-view geometric ground truth data to supervise the fitting of the hair strands. The hairstyle itself is represented in the form of the so-called strand-aligned 3D Gaussians. This representation allows us to combine strand-based hair priors, which are essential for realistic modeling of the inner structure of hairstyles, with the differentiable rendering capabilities of 3D Gaussian Splatting. We evaluate our method on synthetic and real hairstyles and demonstrate state-of-the-art performance in the task of strand-based hair reconstruction.


# 289
Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting

Jeongmin Bae · Seoha Kim · Youngsik Yun · Hahyun Lee · Gun Bang · Youngjung Uh

As 3D Gaussian Splatting (3DGS) provides fast and high-quality novel view synthesis, it is a natural extension to deform a canonical 3DGS to multiple frames. However, we find that previous works fail to accurately reconstruct dynamic scenes, especially 1) static parts moving along nearby dynamic parts, and 2) some motions are blurry. We attribute the failure to the wrong design of the deformation field which is built as a coordinate-based function despite 3DGS is a mixture of multiple fields centered at the Gaussians. Furthermore, the previous methods consider only single-resolution temporal embeddings. To this end, we define the deformation as a function of per-Gaussian embeddings \textit{and} temporal embeddings. Moreover, we decompose deformation as coarse and fine deformations to model slow and fast movements, respectively. Last but not least, we introduce a training strategy for faster convergence and higher quality. Code will be available online.


# 294
Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Yabo Chen · Jiemin Fang · Yuyang Huang · Taoran Yi · Xiaopeng Zhang · Lingxi Xie · Xinggang Wang · Wenrui Dai · Hongkai Xiong · Qi Tian

Synthesizing multi-view 3D from one single image is a significant but challenging task. Zero-1-to-3 methods have achieved great success by lifting a 2D latent diffusion model to the 3D scope. The target-view image is generated with a single-view source image and the camera pose as condition information. However, due to the high sparsity of the single input image, Zero-1-to-3 tends to produce geometry and appearance inconsistency across views, especially for complex objects. To tackle this issue, we propose to supply more condition information for the generation model but in a self-prompt way. A cascade framework is constructed with two Zero-1-to-3 models, named Cascade-Zero123, which progressively extract 3D information from the source image. Specifically, several nearby views are first generated by the first model and then fed into the second-stage model along with the source image as generation conditions. With amplified self-prompted condition images, our Cascade-Zero123 generates more consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects, etc.


# 285
SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Zijie Wu · Chaohui Yu · Yanqin Jiang · Chenjie Cao · Fan Wang · Xiang Bai

Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video. Existing approaches utilize score distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D Gaussians. However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due to the implicit nature of NeRF or the intricate dense Gaussian motion prediction. To address these issues, this paper proposes an efficient, sparse-controlled video-to-4D framework named SC4D, that decouples motion and appearance to achieve superior video-to-4D generation. Moreover, we introduce Adaptive Gaussian (AG) initialization and Gaussian Alignment (GA) loss to mitigate shape degeneration issue, ensuring the fidelity of the learned motion and shape. Comprehensive experimental results demonstrate that our method surpasses existing methods in both quality and efficiency. In addition, facilitated by the disentangled modeling of motion and appearance of SC4D, we devise a novel application that seamlessly transfers the learned motion onto a diverse array of 4D entities according to textual descriptions.


# 283
MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Shitao Tang · Jiacheng Chen · Dilin Wang · Chengzhou Tang · Fuyang Zhang · Yuchen Fan · Vikas Chandra · Yasutaka Furukawa · Rakesh Ranjan

This paper presents a neural architecture MVDiffHD for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. MVDiffHD achieves superior flexibility and scalability with two surprisingly simple ideas: 1) A pose-free architecture'' where standard self-attention among 2D latent features learns 3D consistency across an arbitrary number of conditional and generation views without explicitly using camera pose information; and 2) Aview dropout strategy'' that discards a substantial number of output views during training, which reduces the training-time memory footprint and enables dense and high-resolution view synthesis at test time. We use the Objaverse for training and the Google Scanned Objects for evaluation with standard novel view synthesis and 3D reconstruction metrics, where MVDiffHD significantly outperforms the current state of the arts. We also demonstrate a text-to-3D application example by combining MVDiffHD with a text-to-image generative model.


# 290
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Shijie Zhou · Zhiwen Fan · Dejia Xu · Haoran Chang · Pradyumna Chari · Tejas K Bharadwaj · Suya You · Zhangyang Wang · Achuta Kadambi

The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360 degree scene generation pipeline that facilitates the creation of comprehensive 360 degree scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360 degree perspective, providing an enhanced immersive experience over existing techniques. We will release the code.


# 286
CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Zhengyi Wang · Yikai Wang · Yifei Chen · Chendong Xiang · Shuo Chen · Dajiang Yu · Chongxuan Li · Hang Su · Jun Zhu

Feed-forward 3D generative models like the Large Reconstruction Model (LRM) have demonstrated exceptional generation speed. However, the transformer-based methods do not leverage the geometric priors of the triplane component in their architecture, often leading to sub-optimal quality given the limited size of 3D data and slow training. In this work, we present the Convolutional Reconstruction Model (CRM), a high-fidelity feed-forward single image-to-3D generative model. Recognizing the limitations posed by sparse 3D data, we highlight the necessity of integrating geometric priors into network design. CRM builds on the key observation that the visualization of triplane exhibits spatial correspondence of six orthographic images. First, it generates six orthographic view images from a single input image, then feeds these images into a convolutional U-Net, leveraging its strong pixel-level alignment capabilities and significant bandwidth to create a high-resolution triplane. CRM further employs Flexicubes as geometric representation, facilitating direct end-to-end optimization on textured meshes. Overall, our model delivers a high-fidelity textured mesh from an image in just 10 seconds, without any test-time optimization.


# 98
Strong Double Blind
Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image

Fei Wang

3D reconstruction from a sketch offers an efficient means of boosting the productivity of 3D modeling. However, such a task remains largely under-explored due to the difficulties caused by the inherent abstractive representation and diversity of sketches. In this paper, we introduce a novel deep neural network model, Sketch2Vox, for 3D reconstruction from a single monocular sketch. Taking a sketch as input, the proposed model first converts it into two different representations, i.e., a binary image and a 2D point cloud. Second, we extract semantic features from them using two newly-developed processing modules, including the SktConv module designed for hierarchical abstract features learning from the binary image and the SktMPFM designed for local and global context feature extraction from the 2D point cloud. Prior to feeding features into the 3D-decoder-refiner module for fine-grained reconstruction, the resultant image-based and point-based feature maps are fused together according to their internal correlation using the proposed cross-modal fusion attention module. Finally, we use an optimization module to refine the details of the generated 3D model. To evaluate the efficiency of our method, we collect a large dataset consisting of more than 12,000 Sketch-Voxel pairs and compare the proposed Sketch2Vox against several state-of-the-art methods. The experimental results demonstrate the proposed method is superior to peer ones with regard to reconstruction quality. The dataset is publicly available on https://drive.google.com/file/d/1aXug8PcLnWaDZiWZrcmhvVNFC4n_eAih/view?usp=sharing.


# 30
Strong Double Blind
Lagrangian Hashing for Compressed Neural Field Representations

Shrisudhan Govindarajan · Zeno Sambugaro · Akhmedkhan Shabanov · Towaki Takikawa · Weiwei Sun · Daniel Rebain · Nicola Conci · Kwang Moo Yi · Andrea Tagliasacchi

We present Lagrangian Hashing, a representation for neural fields combining the characteristics of fast training NeRF methods that rely on Eulerian grids (i.e.~InstantNGP), with those that employ points equipped with features as a way to represent information (e.g. 3D Gaussian Splatting or PointNeRF). We achieve this by incorporating a point-based representation into the high-resolution layers of the hierarchical hash tables of an InstantNGP representation. As our points are equipped with a field of influence, our representation can be interpreted as a mixture of Gaussians stored within the hash table. We propose a loss that encourages the movement of our Gaussians towards regions that require more representation budget to be sufficiently well represented. Our main finding is that our representation allows the reconstruction of signals using a more compact representation without compromising quality.


# 261
GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

Jing Wu · Jiawang Bian · Xinghui Li · Guangrun Wang · Ian Reid · Philip Torr · Victor Adrian Prisacariu

We propose GaussCtrl, a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS). Our method first renders a collection of images by using the 3DGS and edits them by using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimise the 3D model. Our key contribution is multi-view consistent editing, which enables editing all images together instead of iteratively editing one image while updating the 3D model as in previous works. It leads to faster editing as well as higher visual quality. This is achieved by the two terms: (a) depth-conditioned editing that enforces geometric consistency across multi-view images by leveraging naturally consistent depth maps. (b) attention-based latent code alignment that unifies the appearance of edited images by conditioning their editing to several reference views through self and cross-view attention between images' latent representations. Experiments demonstrate that our method achieves faster editing and better visual results than previous state-of-the-art methods.


# 265
Strong Double Blind
Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

Shuangkang Fang · Yufeng Wang · Yi-Hsuan Tsai · Yi Yang · Wenrui Ding · Shuchang Zhou · Ming-Hsuan Yang

Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users' flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities.


# 281
TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation

Nikolai Kalischek · Torben Peters · Jan Dirk Wegner · Konrad Schindler

Probabilistic denoising diffusion models (DDMs) have set a new standard for 2D image generation. Extending DDMs for 3D content creation is an active field of research. Here, we propose TetraDiffusion, a diffusion model that operates on a tetrahedral partitioning of 3D space to enable efficient, high-resolution 3D shape generation. Our model introduces operators for convolution and transpose convolution that act directly on the tetrahedral partition, and seamlessly includes additional attributes like color. Our design generates mesh geometry much more efficiently: Compared to existing mesh diffusion techniques, TetraDiffusion is up to 200x faster. At the same time, it reduces memory consumption and can operate at substantially higher resolution than existing mesh generators. Using only standard consumer hardware, it sets a new standard in terms of spatial detail and outperforms other mesh generators across a range of quality metrics.


# 256
Strong Double Blind
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

Dong Huo · Zixin Guo · Xinxin Zuo · Zhihao Shi · Juwei Lu · Peng Dai · Xu Songcen · Li Cheng · Yee-Hong Yang

Given a 3D mesh, we aim to synthesize 3D textures that correspond to arbitrary textual descriptions. Current methods for generating and assembling textures from sampled views often result in prominent seams or excessive smoothing. To tackle these issues, we present TexGen, a novel multi-view sampling and resampling framework for texture generation leveraging a pre-trained text-to-image diffusion model. For view consistent sampling, first of all we maintain a texture map in RGB space that is parameterized by the denoising step and updated after each sampling step of the diffusion model to progressively reduce the view discrepancy. An attention-guided multi-view sampling strategy is exploited to broadcast the appearance information across views. To preserve texture details, we develop a noise resampling technique that aids in the estimation of noise, generating inputs for subsequent denoising steps, as directed by the text prompt and current texture map. Through an extensive amount of qualitative and quantitative evaluations, we demonstrate that our proposed method produces significantly better texture quality for diverse 3D objects with a high degree of view consistency and rich appearance details, outperforming current state-of-the-art methods. Furthermore, our proposed texture generation technique can also be applied to texture editing while preserving the original identity. More experimental results are available at https://anonymousachilles.github.io/


# 239
Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation

Xiaofeng Yang · Yiwen Chen · Cheng Chen · Chi Zhang · Yi Xu · Xulei Yang · Fayao Liu · Guosheng Lin

In this paper, we propose a unified framework aimed at enhancing the diffusion priors for 3D generation tasks. Despite the critical importance of these tasks, existing methodologies often struggle to generate high-caliber results. We begin by examining the inherent limitations in previous diffusion priors. We identify a divergence between the diffusion priors and the training procedures of diffusion models that substantially impairs the quality of 3D generation. To address this issue, we propose a novel, unified framework that iteratively optimizes both the 3D model and the diffusion prior. Leveraging the different learnable parameters of the diffusion prior, our approach offers multiple configurations, affording various trade-offs between performance and implementation complexity. Notably, our experimental results demonstrate that our method markedly surpasses existing techniques, establishing new state-of-the-art in the realm of text-to-3D generation. Additionally, our framework yields insightful contributions to the understanding of recent score distillation methods, such as the VSD loss and CSD loss.


# 280
LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Kevin Xie · Tianshi Cao · Jonathan P Lorraine · Jun Gao · James R Lucas · Antonio Torralba · Sanja Fidler · Xiaohui Zeng

Recent text-to-3D generation approaches produce impressive 3D results but require time-consuming optimization that can take up to an hour per prompt. Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis. However, ATT3D cannot capture high-frequency geometry and texture details and struggles to scale to large prompt sets, so it generalizes poorly. We introduce Latte3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set. Key to our method is 1) building a scalable architecture for amortized learning and 2) leveraging 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. Latte3D amortizes both neural field generation and textured surface generation to produce highly detailed textured meshes in a single forward pass. Latte3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.


# 258
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

Fangfu Liu · Hanyang Wang · Weiliang Chen · Haowen Sun · Yueqi Duan

Recent years have witnessed the strong power of 3D generation models, which offer a new level of creative flexibility by allowing users to guide the 3D content generation process through a single image or natural language. However, it remains challenging for existing 3D generation methods to create subject-driven 3D content across diverse prompts. In this paper, we introduce a novel 3D customization method, dubbed Make-Your-3D that can personalize high-fidelity and consistent 3D content from only a single image of a subject with text description within 5 minutes. Our key insight is to harmonize the distributions of a multi-view diffusion model and an identity-specific 2D generative model, aligning them with the distribution of the desired 3D subject. Specifically, we design a co-evolution framework to reduce the variance of distributions, where each model undergoes a process of learning from the other through identity-aware optimization and subject-prior optimization, respectively. Extensive experiments demonstrate that our method can produce high-quality, consistent, and subject-specific 3D content with text-driven modifications that are unseen in subject image.


# 266
Synthesizing Environment-Specific People in Photographs

Mirela Ostrek · Carol O'Sullivan · Michael J. Black · Justus Thies

Despite significant progress in generative image synthesis, and full-body generation in particular, state-of-the-art methods are either context-independent, overly reliant to text prompts, or bound to the specific, curated, training datasets, such as fashion images with monotonous backgrounds. Here, our goal is to generate people wearing clothing that is semantically appropriate for a given scene. To this end, we present ESP, a novel method for context-aware full-body generation, that enables photo-realistic inpainting of people into existing “in-the-wild” photographs. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are then used as tight guiding masks for inpainting and no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation. Code will be public for research.


# 259
Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

Phuong Dam · Jihoon Jeong · Anh Tran · Daeyoung Kim

This study discusses the critical issues of Virtual Try-On in contemporary e-commerce and the prospective metaverse, emphasizing the challenges of preserving intricate texture details and distinctive features of the target person and the clothes in various scenarios, such as clothing texture and identity characteristics like tattoos or accessories. In addition to the fidelity of the synthesized images, the efficiency of the synthesis process presents a significant hurdle. Various existing approaches are explored, highlighting the limitations and unresolved aspects, e.g., identity information omission, uncontrollable artifacts, and low synthesis speed. It then proposes a novel diffusion-based solution that addresses garment texture preservation and user identity retention during virtual try-on. The proposed network comprises two primary modules - a warping module aligning clothing with individual features and a try-on module refining the attire and generating missing parts integrated with a mask-aware post-processing technique ensuring the integrity of the individual's identity. It demonstrates impressive results, surpassing the state-of-the-art in speed by nearly 20 times during inference, with superior fidelity in qualitative assessments. Quantitative evaluations confirm comparable performance with the recent SOTA method on the VITON-HD and Dresscode datasets.


# 272
Shapefusion: 3D localized human diffusion models

Rolandos Alexandros Potamias · Michael Tarasiou · Stylianos Ploumpis · Stefanos Zafeiriou

In the realm of 3D computer vision, parametric models have emerged as a ground-breaking methodology for the creation of realistic and expressive 3D avatars. Traditionally, they rely on Principal Component Analysis (PCA), given its ability to decompose data to an orthonormal space that maximally captures shape variations. However, due to the orthogonality constraints and the global nature of PCA's decomposition, these models struggle to perform localized and disentangled editing of 3D shapes, which severely affects their use in applications requiring fine control such as face sculpting. In this paper, we leverage diffusion models to enable diverse and fully localized edits on 3D meshes, while completely preserving the un-edited regions. We propose an effective diffusion masking training strategy that, by design, facilitates localized manipulation of any shape region, without being limited to predefined regions or to sparse sets of predefined control vertices. Following our framework, a user can explicitly set their manipulation region of choice and define an arbitrary set of vertices as handles to edit a 3D mesh. Compared to the current state-of-the-art our method leads to more interpretable shape manipulations than methods relying on latent code state, greater localization and generation diversity while offering faster inference than optimization based approaches. Code has been included in the supplementary material.


# 282
Strong Double Blind
Fast Sprite Decomposition from Animated Graphics

Tomoyuki Suzuki · Kotaro Kikuchi · Kota Yamaguchi

This paper presents an approach to decomposing animated graphics into sprites, a set of basic elements or layers. Our approach builds on the optimization of sprite parameters to fit the raster video. For efficiency, we assume static textures for sprites to reduce the search space while preventing artifacts using a texture prior model. To further speed up the optimization, we introduce the initialization of the sprite parameters utilizing a pre-trained video object segmentation model and user input of single frame annotations. For our study, we construct the Crello Animation dataset from an online design service and define quantitative metrics to measure the quality of the extracted sprites. Experiments show that our method significantly outperforms baselines for similar decomposition tasks in terms of the quality/efficiency tradeoff.


# 269
Strong Double Blind
Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution

Mridul Khurana · Arka Daw · M. Maruf · Josef C. Uyeda · Wasila Dahdul · Caleb Charpentier · Yasin Bakış · Henry L. Bart · Paula M. Mabee · Hilmar Lapp · James P. Balhoff · Wei-Lun Chao · Charles Stewart · Tanya Berger-Wolf · Anuj Karpatne

A central problem in evolutionary biology is to explore the genetic basis of evolutionary changes in the traits of organisms, such as fin structures in fish or beak colors in birds. With the growing availability of large-scale image repositories in biology and recent advances in generative modeling, there is an opportunity to study changes in evolutionary traits of species automatically from images. We introduce a novel Hierarchical Embedding (HIER-Embed) strategy to encode the evolutionary information of a species as a composition of encodings learned at every internal node in the phylogenetic tree. We use HIER-Embeddings to condition latent diffusion models to generate synthetic images of species. Further, we introduce two novel types of perturbation operations: trait masking and trait swapping, similar in spirit to gene knockout experiments, that enable us to analyze novel changes in evolutionary traits acquired at different levels of phylogeny.


# 263
Strong Double Blind
WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation

Zirui Shao · Feiyu Gao · Hangdi Xing · Zepeng Zhu · Zhi Yu · Jiajun Bu · Qi Zheng · Cong Yao

In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. WebRPG would contribute to a faster web development workflow. Since there is no existing benchmark available, we develop a new dataset for WebRPG through an automated pipeline. Moreover, we present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML. Extensive experiments, including customized quantitative evaluations for this specific task, are conducted to evaluate the quality of the generated results.


# 251
Dolfin: Diffusion Layout Transformers without Autoencoder

Yilin Wang · Zeyuan Chen · Liangjun Zhong · Zheng Ding · Zhuowen Tu

In this paper, we introduce a novel generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), which significantly improves the modeling capability with reduced complexity compared to existing methods. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we further propose an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing rich semantic correlations, such as alignment, size, overlap, and neighborhood, between layout items/elements. When evaluated against standard generative layout benchmarks, Dolfin notably improves performance across various metrics, enhancing transparency and interoperability in the process. Moreover, Dolfin's applications extend beyond layout generation, making it suitable for modeling generative geometric structures, such as line segments. Our experiments present both qualitative and quantitative results to demonstrate the advantages of Dolfin.


# 284
Strong Double Blind
MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes

Casper van Engelenburg · Fatemeh Mostafavi · Emanuel Kuhn · Yuntae Jeon · Michael Franzen · Matthias Standfest · Jan Gemert · Seyran Khademi

Diverse and realistic floor plan data are essential for the development of useful computer-aided methods in architectural design. Today's large-scale floor plan datasets predominantly feature simple floor plan layouts, typically representing single-apartment dwellings only. To compensate for the mismatch between current datasets and the real world, we develop \textbf{Modified Swiss Dwellings} (MSD) -- the first large-scale floor plan dataset that contains a significant share of layouts of multi-apartment dwellings. MSD features over 5.3K floor plans of medium- to large-scale building complexes, covering over 18.9K distinct apartments. We validate that existing approaches for floor plan generation, while effective in simpler scenarios, cannot yet seamlessly address the challenges posed by MSD. Our benchmark calls for new research in floor plan machine understanding. Code and data are open.


# 288
RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion

Kyle Lo · Jorg Peters · Eric Spellman

Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and can so handle up to 99% point sparsity and 80% roof area occlusion (regional incompleteness). A variant, RoofDiffusionNF, simultaneously predicts building footprints and heights. Both quantitatively outperform state-of-the-art unguided depth completion and representative inpainting methods for Digital Elevation Models (DEM), on both a roof-specific benchmark and the BuildingNet dataset. Qualitative assessments show the effectiveness of RoofDiffusion for datasets with real-world scans including AHN3, Dales3D, and USGS 3DEP LiDAR. Tested with the leading City3D algorithm, preprocessing height maps with RoofDiffusion noticeably improves 3D building reconstruction. RoofDiffusion is complemented by a new dataset of 13k complex roof geometries, focusing on long-tail issues in remote sensing; a novel simulation of tree occlusion; and a wide variety of large-area roof cut-outs for data augmentation and benchmarking. Code and dataset will be available on GitHub.


# 59
Strong Double Blind
Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds

Shengtao Li · Ge Gao · Yudong Liu · Ming Gu · Yushen Liu

Neural signed distance functions (SDFs) have shown powerful ability in fitting the shape geometry. However, inferring continuous signed distance fields from discrete unoriented point clouds still remains a challenge. The neural network typically fits the shape with a rough surface and omits fine-grained geometric details such as shape edges and corners. In this paper, we propose a novel non-linear implicit filter to smooth the implicit field while preserving high-frequency geometry details. Our novelty lies in that we can filter the surface (zero level set) by the neighbor input points with gradients of the signed distance field. By moving the input raw point clouds along the gradient, our proposed implicit filtering can be extended to non-zero level sets to keep the promise consistency between different level sets, which consequently results in a better regularization of the zero level set. We conduct comprehensive experiments in surface reconstruction from objects and complex scene point clouds, the numerical and visual comparisons demonstrate our improvements over the state-of-the-art methods under the widely used benchmarks.


# 63
Strong Double Blind
FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

tianyu zhang · Guocheng Qian · Jin Xie · Jian Yang

Point cloud frame interpolation is a challenging task that involves accurate scene flow estimation across frames and maintaining the geometry structure. Prevailing techniques often rely on pre-trained motion estimators or intensive testing-time optimization, resulting in compromised interpolation accuracy or prolonged inference. This work presents FastPCI that introduces Pyramid Convolution-Transformer architecture for point cloud frame interpolation. Our hybrid Convolution-Transformer improves the local and long-range feature learning, while the pyramid network offers multilevel features and reduces the computation. In addition, FastPCI proposes a unique Dual-Direction Motion-Structure block for more accurate scene flow estimation. Our design is motivated by two facts: (1) accurate scene flow preserves 3D structure, and (2) point cloud at the previous timestep should be reconstructable using reverse motion from future timestep. Extensive experiments show that FastPCI significantly outperforms the state-of-the-art PointINet and NeuralPCI with notable gains (26.6% and 18.3% reduction in Chamfer Distance in KITTI), while being more than 10x and 600x faster, respectively.


# 58
Strong Double Blind
T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy

Fan Duan · Jiahao Yu · Li Chen

Point clouds are commonly used in various practical applications such as autonomous driving and the manufacturing industry. However, these point clouds often suffer from incompleteness due to limited perspectives, scanner resolution and occlusion. Therefore the prediction of missing parts performs a crucial task. In this paper, we propose a novel method for point cloud completion. We utilize a spherical template to guide the generation of the coarse complete template and generate the dynamic query tokens through a correspondence pooling (Corres-Pooling) query generator. Specifically, we first generate the coarse complete template by embedding a Gaussian spherical template into the partial input and transforming the template to best match the input. Then we use the Corres-Pooling query generator to refine the coarse template and generate dynamic query tokens which could be used to predict the complete point proxies. Finally, we generate the complete point cloud with a FoldingNet following the coarse-to-fine paradigm, according to the fine template and the predicted point proxies. Experimental results demonstrate that our T-CorresNet outperforms the state-of-the-art methods on several benchmarks. We will release our code when this paper is accepted.


# 60
Strong Double Blind
SEED: A Simple and Effective 3D DETR in Point Clouds

Zhe Liu · Jinghua Hou · Xiaoqing Ye · Tong Wang · Jingdong Wang · Xiang Bai

Recently, detection transformers (DETRs) have gradually taken a dominant position in 2D detection thanks to their elegant framework. However, DETR-based detectors for 3D point clouds are still difficult to achieve satisfactory performance. We argue that the main challenges are twofold: 1) How to obtain the appropriate object queries is challenging due to the high sparsity and uneven distribution of point clouds; 2) How to implement an effective query interaction by exploiting the rich geometric structure of point clouds is not fully explored. To this end, we propose a Simple and EffEctive 3D DETR method (SEED) for detecting 3D objects from point clouds, which involves a dual query selection (DQS) module and a deformable grid attention (DGA) module. More concretely, to obtain appropriate queries, DQS first ensures a high recall to retain a large number of queries by the predicted confidence scores and then further picks out high-quality queries according to the estimated quality scores. DGA uniformly divides each reference box into grids as the reference points and then utilizes the predicted offsets to achieve a flexible receptive field, allowing the network to focus on relevant regions and capture more informative features. Extensive ablation studies on DQS and DGA demonstrate its effectiveness. Furthermore, our SEED achieves state-of-the-art detection performance on both the large-scale Waymo and nuScenes datasets, illustrating the superiority of our proposed method. Code will be available.


# 61
Strong Double Blind
ProtoComp: Diverse Point Cloud Completion with Controllable Prototype

Xumin Yu · Yanbo Wang · Jie Zhou · Jiwen Lu

Point cloud completion aims to reconstruct the geometry of partial point clouds captured by various sensors. Traditionally, training a point cloud model is carried out on synthetic datasets, which have limited categories and deviate significantly from real-world scenarios. This disparity often leads existing methods to struggle with unfamiliar categories and severe incompleteness in real-world situations. In this paper, we propose \textbf{PrototypeCompletion}, a novel prototype-based approach for point cloud completion. It begins by generating rough prototypes and subsequently augments them with additional geometry details for the final prediction. With just a few hundred pairs of partial-complete point cloud data, our approach effectively handles the point clouds from diverse scenarios in real-world situations, including indoor ScanNet and outdoor KITTI. Additionally, we propose a new metric and test benchmark based on ScanNet200 and KITTI to evaluate the model's performance in real-world scenarios, aiming to promote future research. Experimental results demonstrate that our method outperforms state-of-the-art methods on existing PCN benchmark and excels in various real-world situations with different object categories and sensors. The code will be made available.


# 62
Strong Double Blind
CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation

Hajin Shim · Changhun Kim · Eunho Yang

3D point clouds captured from real-world sensors frequently encompass noisy points due to various obstacles, such as occlusion, limited resolution, and variations in scale. This poses challenges when deploying pre-trained point cloud recognition models trained on clean point clouds, leading to significant performance degradation. While test-time adaptation (TTA) strategies have shown promising results in addressing this issue in the 2D domain, its application to 3D point clouds remains under-explored. Among TTA methods, an input adaptation approach, which directly converts test instances to the source domain using a pre-trained diffusion model, has been proposed in the 2D domain. Despite its robust TTA performance in practical situations, naively adopting this into the 3D domain may be suboptimal due to the neglect of inherent properties of point clouds, and its prohibitive computational cost. Motivated by such limitations, we propose CloudFixer, a test-time input adaptation method tailored for 3D point clouds, employing pre-trained diffusion model. Specifically, CloudFixer optimizes geometric transformation parameters with carefully designed objectives that leverage the geometric properties of point clouds. We also substantially improve computational efficiency by avoiding backpropagation through the diffusion model or extensive generation process. Furthermore, we propose an online model adaptation strategy by aligning the original model prediction with that of the adapted input. Extensive experiments showcase the superiority of CloudFixer over various TTA baselines, excelling in handling common corruptions and natural distribution shifts across diverse real-world scenarios.


# 57
Strong Double Blind
Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes

Chao Chen · Yushen Liu · Zhizhong Han

It is challenging to reconstruct 3D point clouds in unseen classes from single 2D images. Instead of object-centered coordinate system, current methods generalized global priors learned in seen classes to reconstruct 3D shapes from unseen classes in viewer-centered coordinate system. However, the reconstruction accuracy and interpretability are still eager to get improved. To resolve this issue, we introduce to learn local pattern modularization for reconstructing 3D shapes in unseen classes, which achieves both good generalization ability and high reconstruction accuracy. Our insight is to learn a local prior which is class-agnostic and easy to generalize in object-centered coordinate system. Specifically, the local prior is learned via a process of learning and customizing local pattern modularization in seen classes. During this process, we first learn a set of patterns in local regions, which is the basis in the object-centered coordinate system to represent an arbitrary region on shapes across different classes. Then, we modularize each region on an initially reconstructed shape using the learned local patterns. Based on that, we customize the local pattern modularization using the input image by refining the reconstruction with more details. Our method enables to reconstruct high fidelity point clouds from unseen classes in object-centered coordinate system without requiring a large number of patterns or any additional information, such as segmentation supervision or camera poses. Our experimental results under widely used benchmarks show that our method achieves the state-of-the-art reconstruction accuracy for shapes from unseen classes.


# 65
Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains

Jaeyeul Kim · Jungwan Woo · Jeonghoon Kim · Sunghoon Im

In the realm of LiDAR-based perception, significant strides have been made, yet domain generalization remains a substantial challenge. The performance often deteriorates when models are applied to unfamiliar datasets with different LiDAR sensors or deployed in new environments, primarily due to variations in point cloud density distributions. To tackle this challenge, we propose a Density Discriminative Feature Embedding (DDFE) module, capitalizing on the observation that a single source LiDAR point cloud encompasses a spectrum of densities. The DDFE module is meticulously designed to extract density-specific features within a single source domain, facilitating the recognition of objects sharing similar density characteristics across different LiDAR sensors. In addition, we introduce a simple yet effective density augmentation technique aimed at expanding the spectrum of density in source data, thereby enhancing the capabilities of the DDFE. Our DDFE stands out as a versatile and lightweight domain generalization module. It can be seamlessly integrated into various 3D backbone networks, where it has demonstrated superior performance over current state-of-the-art domain generalization methods. We will make the source code publicly available to promote collaborative progress in the field.


# 64
Strong Double Blind
Multi-modal Relation Distillation for Unified 3D Representation Learning

Huiqun Wang · Yiping Bao · Panwang Pan · Zeming Li · Xiao Liu · Ruijie Yang · Di Huang

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning multi-modal features across 3D shapes, corresponding 2D images, and language descriptions. However, this straightforward alignment often overlooks the intricate structural relationships among the samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pretraining framework designed to effectively distill state-of-the-art large multi-modal models into 3D backbones. MRD focuses on distilling both the intra-relations within each modality and the cross-relations between different modalities, aiming to produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering state-of-the-art performance.


# 94
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Muhammad Zubair Irshad · Sergey Zakharov · Vitor Guizilini · Adrien Gaidon · Zsolt Kira · Rares Ambrus

Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection. Project Page: https://nerf-mae.github.io


# 80
Strong Double Blind
Single-Photon 3D Imaging with Equi-Depth Photon Histograms

Kaustubh Sadekar · David Maier · Atul Ingle

Single-photon cameras present a promising avenue for high-resolution 3D imaging. They have ultra-high sensitivity---down to individual photons---and can record photon arrival times with extremely high (sub-nanosecond) resolution. Single-photon 3D cameras estimate the round-trip time of a laser pulse by forming equi-width (EW) histograms of detected photon timestamps. Acquiring and transferring such EW histograms requires high bandwidth and in-pixel memory, making SPCs less attractive for 3D-perception applications in resource-constrained settings such as mobile devices and AR/VR headsets. In this work we propose a new 3D sensing technique based on equi-depth (ED) histograms. ED histograms compress timestamp data more efficiently than EW histograms, reducing the bandwidth requirement. Moreover, to reduce the in-pixel memory requirement, we propose a lightweight algorithm to estimate ED histograms in an online fashion without explicitly storing the photon timestamps. This algorithm is amenable to future in-pixel implementations. We propose algorithms that process ED histograms to perform 3D computer-vision tasks of estimating scene distance maps and performing visual odometry under challenging conditions such as high ambient light. Our work paves the way towards lower bandwidth and reduced in-pixel memory requirements for SPCs, making them attractive for resource-constrained 3D vision applications.


# 97
Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment

Simon Weber · Je Hyeong Hong · Daniel Cremers

Initialization-free bundle adjustment (BA) remains largely uncharted. While Levenberg-Marquardt algorithm is the golden method to solve the BA problem, it generally relies on a good initialization. In contrast, the under-explored Variable Projection algorithm (VarPro) exhibits a wide convergence basin even without initialization. Coupled with object space error formulation, recent works have shown its ability to solve (small-scale) initialization-free bundle adjustment problem. We introduce Power Variable Projection (PoVar), extending a recent inverse expansion method based on power series. Importantly, we link the power series expansion to Riemannian manifold optimization. This projective framework is crucial to solve large-scale bundle adjustment problem without initialization. Using the real-world BAL dataset, we experimentally demonstrate that our solver achieves state-of-the-art results in terms of speed and accuracy. In particular, our work is the first, to our knowledge, that addresses the scalability of BA without initialization and opens new venues for initialization-free Structure-from-Motion.


# 101
Strong Double Blind
SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes

Mohammad Zohaib · Luca Cosmo · Alessio Del Bue

Unsupervised 3D keypoints estimation from Point Cloud Data (PCD) is a complex task, even more challenging when an object shape is deforming. As keypoints should be semantically and geometrically consistent across all the 3D frames -- each keypoint should be anchored to a specific part of the deforming shape irrespective of intrinsic and extrinsic motion. This paper presents, "SelfGeo", a self-supervised method that computes persistent 3D keypoints of non-rigid objects from arbitrary PCDs without the need of human annotations. The gist of SelfGeo is to estimate keypoints between frames that respect invariant properties of deforming bodies. Our main contribution is to enforce that keypoints deform along with the shape while keeping constant geodesic distances among them. This principle is then propagated to the design of a set of losses which minimization let emerge repeatable keypoints in specific semantic locations of the non-rigid shape. We show experimentally that the use of geodesic has a clear advantage in challenging dynamic scenes and with different classes of deforming shapes (humans and animals). Code and data will be made available upon paper acceptance.


# 86
Strong Double Blind
Leveraging scale- and orientation-covariant features for planar motion estimation

Marcus Valtonen Örnhag · Alberto Jaenal

In this paper, we derive a linear constraint for planar motion leveraging scale- and orientation covariant features, e.g., SIFT, which is used to create a novel minimal solver for planar motion requiring only a single covariant feature. We compare the proposed method to traditional point-based solvers and solvers relying on affine correspondences in controlled synthetic environments and well-established datasets for autonomous driving. The proposed solver is integrated in a modern robust estimation framework, where it is shown to accelerate the complete estimation pipeline more than 25x, compared to state-of-the-art affine-based minimal solvers, with negligible loss in precision.


# 90
Strong Double Blind
Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Baicheng Li · Zike Yan · Dong Wu · Hanqing Jiang · Hongbin Zha

Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention due to the expressive representation power and the innovative paradigm of continual learning. However, deploying such a system within a dynamic environment has not been well-studied. Such challenges are intractable even for conventional algorithms since observations from different views with dynamic objects involved break the geometric and photometric consistency, whereas the consistency lays the foundation for joint optimizing the camera pose and the map parameters. In this paper, we best exploit the characteristics of continual learning and propose a novel SLAM framework for dynamic environments. While past efforts have been made to avoid catastrophic forgetting by exploiting an experience replay strategy, we view forgetting as a desirable characteristic: by adaptively controlling the replayed buffer, the ambiguity caused by moving objects can be easily alleviated through forgetting. We restrain the replay of the dynamic objects by introducing a continually-learned classifier for dynamic object identification. The iterative optimization of the neural map and the classifier notably improves the robustness of the SLAM system under a dynamic environment. Experiments on challenging datasets verify the effectiveness of the proposed framework.


# 108
Strong Double Blind
Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision

Jinhee Kim · Taesung Kim · Choo Jaegul

Recent advances in interactive keypoint estimation methods have enhanced keypoint estimation accuracy while aiming to minimize user intervention. However, these methods still depend on user input for error correction, a significant challenge in vertebrae keypoint estimation where densely clustered or overlapping keypoints are common. We introduce a novel two-stage approach that integrates KeyBot that specifically designed to identify and correct significant errors in existing models into existing keypoint estimation frameworks. It is specifically designed to analyze current model predictions and deliver corrective feedback akin to user revision. Trained on simulated error scenarios, KeyBot effectively corrects typical errors in vertebrae keypoint estimation, thereby significantly reducing user workload. Comprehensive quantitative and qualitative evaluations on three public datasets confirm that KeyBot significantly outperforms existing methods, achieving state-of-the-art performance in interactive vertebrae keypoint estimation.


# 88
Strong Double Blind
TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly

Mengqi GUO · Chen Li · Yuyang Zhao · Gim Hee Lee

Assembling 3D objects from primitive bricks is challenging due to complex constraints and numerous possible combinations. Recent studies have demonstrated promising results on sequential LEGO brick assembly by graph modeling. However, existing approaches are class-specific and require significant computational and 3D annotation resources. In this work, we first propose a computationally efficient breadth-first search (BFS) LEGO-Tree structure to model sequential assembly actions. Based on the LEGO-Tree structure, we then design a class-agnostic tree-transformer framework to predict assembly actions from multi-view images. A major challenge is the costly acquisition of step-wise action labels. We address this by leveraging synthetic-to-real transfer learning. Specifically, our model pre-trains on synthetic data with full action label supervision. We further circumvent the requirement for real data action labels by introducing an action-to-silhouette projection for self-supervision. With no real data annotation, our model outperforms existing methods with 3D supervision by 7.8% and 11.3% in mIoU on the MNIST and ModelNet Construction datasets, respectively.


# 99
SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction

Yuliang Guo · Abhinav Kumar · Cheng Zhao · Ruoyu Wang · Xinyu Huang · Liu Ren

Monocular 3D reconstruction for categorical objects heavily relies on accurately perceiving each object's pose. While gradient-based optimization within a NeRF framework updates initially given poses, this paper highlights that such a scheme fails when the initial pose even moderately deviates from the true pose. Consequently, existing methods often depend on a third-party 3D object to provide an initial object pose, leading to increased complexity and generalization issues. To address these challenges, we present UPNeRF, a Unified network integrating Pose estimation and NeRF-based reconstruction, bringing us closer to real-time monocular 3D object reconstruction. UPNeRF decouples the object's dimension estimation and pose refinement to resolve the scale-depth ambiguity, and introduces an effective projected-box representation that generalizes well cross different domains. While using a dedicated pose estimator that smoothly integrates into an object-centric NeRF , UPNeRF is free from external 3D detectors. UPNeRF achieves state-of-the-art results in both reconstruction and pose estimation tasks on the nuScenes dataset. Furthermore, UPNeRF exhibits exceptional Cross-dataset generalization on the KITTI and Waymo datasets, surpassing prior methods with up to 50\% reduction in rotation and translation error.


# 102
VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

Guénolé Fiche · Simon Leglaive · Xavier Alameda-Pineda · Antonio Agudo · Francesc Moreno

Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work introduces a novel paradigm to address the HPSE problem, involving a low-dimensional discrete latent representation of the human mesh and framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, we focus on predicting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages. Firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes even when little training data is available. Secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. The proposed model, VQ-HPS, predicts the discrete latent representation of the mesh. The experimental results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods when trained with few data. VQ-HPS also shows promising results when training on large-scale datasets, highlighting the significant potential of the classification approach for HPSE.


# 106
Strong Double Blind
Human Pose Recognition via Occlusion-Preserving Abstract Images

Saad Manzur · Wayne B Hayes

Existing 2D-to-3D pose lifting networks suffer from poor performance in cross-dataset benchmarks. Although 2D keypoints joined by "stick-figure’’ limbs is the dominant trend, stick-figures do not preserve occlusion information that is inherent in an image, resulting in significant ambiguities that are ruled out when occlusion information is present. In addition, datasets with ground truth 3D poses are much harder to obtain in contrast to similar human annotated 2D datasets. To address these issues, we propose to replace stick figures with abstract images ---figures with opaque limbs that preserve occlusion information while implicitly encoding joint locations. We then break down the pose estimation task into two stages: (1) Generating an abstract image from a real image, and (2) garnering the pose from the abstract image. Crucially, given the GT 3D keypoints for a particular pose, we can synthesize an arbitrary number of abstract images of the same pose as seen from arbitrary cameras, even without a part map. Given a set of 3D GT keypoints, this allows training of Stage (2) on an unlimited dataset without over-training, which in turn allows us to correctly interpret poses from arbitrary viewpoints not included in the original dataset. Additionally, our unlimited training of Stage 2 allows good generalizations across datasets, demonstrated through a significant improvement in cross-dataset benchmarks, while still showing competitive performance in same-dataset benchmark.


# 109
Strong Double Blind
RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark

Yuan-Hao Ho · Jen-Hao Cheng · Sheng Yao Kuan · Zhongyu Jiang · Wenhao Chai · Hsiang-Wei Huang · Chih-Lung Lin · Jenq-Neng Hwang

Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method more conducive to practical deployments. This paper presents a Radar Tensor-based human pose (RT-Pose) dataset and an open-source benchmarking framework. RT-Pose dataset comprises 4D radar tensors, LiDAR point clouds, and RGB images, and is collected for a total of 72k frames across 240 sequences with six different complexity level actions. The 4D radar tensor provides raw spatio-temporal information, differentiating it from other radar point cloud-based datasets. We develop a semi-automatic annotation process, which uses RGB images and LiDAR point clouds to accurately label 3D human skeletons.In addition, we propose HRRadarPose, the first single-stage architecture that extracts the high-resolution representation of 4D radar tensors in 3D space to aid human keypoint estimation. HRRadarPose outperforms previous radar-based HPE work on the RT-Pose benchmark. The overall HRRadarPose performance on the RT-Pose dataset, as reflected in a mean per joint position error (MPJPE) of 9.91 cm, indicates the persistent challenges in achieving accurate HPE in complex real-world scenarios.


# 107
Strong Double Blind
6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry

Sungho Chun · Ju Yong Chang

This study addresses the nuanced challenge of estimating head translations within the context of six-degrees-of-freedom (6DoF) head pose estimation, placing emphasis on this aspect over the more commonly studied head rotations. Identifying a gap in existing methodologies, we recognized the underutilized potential synergy between facial geometry and head translation. To bridge this gap, we propose a novel approach called the head \textbf{T}ranslation, \textbf{R}otation, and face \textbf{G}eometry network (TRG), which stands out for its explicit bidirectional interaction structure. This structure has been carefully designed to leverage the complementary relationship between face geometry and head translation, marking a significant advancement in the field of head pose estimation. Our contributions also include the development of a strategy for estimating bounding box correction parameters and a technique for aligning landmarks to image. Both of these innovations demonstrate superior performance in 6DoF head pose estimation tasks. Extensive experiments conducted on ARKitFace and BIWI datasets confirm that the proposed method outperforms current state-of-the-art techniques. The code will be released.


# 105
Strong Double Blind
HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Eugene Valassakis · Guillermo Garcia-Hernando

Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous works typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. Additionally, we introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks. The code and models will be made available.


# 122
On the Utility of 3D Hand Poses for Action Recognition

Md Salman Shamil · Dibyadip Chatterjee · Fadime Sener · Shugao Ma · Angela Yao

3D hand poses are an under-explored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. To efficiently model hand-object interactions, we propose HandFormer, a novel multimodal transformer. HandFormer combines 3D hand poses at a high temporal resolution for fine-grained motion modeling with sparsely sampled RGB frames for encoding scene semantics. Observing the unique characteristics of hand poses, we temporally factorize hand modeling and represent each joint by its short-term trajectories. This factorized pose representation combined with sparse RGB samples is remarkably efficient and achieves high accuracy. Unimodal HandFormer with only hand poses outperforms existing skeleton-based methods at 5x fewer FLOPs. With RGB, we achieve new state-of-the-art performance on Assembly101 and H2O with significant improvements in egocentric action recognition.


# 120
Strong Double Blind
Multi-Person Pose Forecasting with Individual Interaction Perceptron and Prior Learning

Peng Xiao · Yi Xie · Xuemiao Xu · Weihong Chen · Huaidong Zhang

Human Pose Forecasting is a major problem in human intention comprehension that can be addressed through learning the historical poses via deep methods. However, existing methods often lack the modeling of the person's role in the event in multi-person scenes. This leads to limited performance in complicated scenes with variant interactions happening at the same time. In this paper, we introduce the Interaction-Aware Pose Forecasting Transformer (IAFormer) framework to better learn the interaction features. With the key insight that the event often involves only part of the people in the scene, we designed the Interaction Perceptron Module (IPM) to evaluate the human-to-event interaction level. With the interaction evaluation, the human-independent features are extracted with the attention mechanism for interaction-aware forecasting. In addition, an Interaction Prior Learning Module (IPLM) is presented to learn and accumulate prior knowledge of high-frequency interactions, encouraging semantic pose forecasting rather than simple trajectory pose forecasting. We conduct experiments using datasets such as CMU-Mocap, UMPM, CHI3D, Human3.6M datasets. The results demonstrate that our method significantly outperforms state-of-the-art approaches considering scenarios with varying numbers of people.


# 118
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

Guanxing Lu · Shiyi Zhang · Ziwei Wang · Changliu Liu · Jiwen Lu · Yansong Tang

Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1\% in average success rate.


# 93
Strong Double Blind
Revisit Self-supervision with Local Structure-from-Motion

Shengjie Zhu · Xiaoming Liu

Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within $5$ frames already benefits SoTA supervised depth and correspondence models. Despite self-supervision, we outperform COLMAP in pose accuracy and robustness. Finally, our method enables NeRF over arbitrary short videos. Codes and models will be released.


# 96
AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation

Yangchao Wu · Tian Yu Liu · Hyoungseob Park · Stefano Soatto · DONG LAO · Alex Wong

Unsupervised depth completion methods are trained by minimizing sparse depth and image reconstruction error. Block artifacts from resampling, intensity saturation, and occlusions are amongst the many undesirable by-products of common data augmentation schemes that affect image reconstruction quality, and thus the training signal. Hence, typical augmentations on images viewed as essential to training pipelines in other vision tasks have seen limited use beyond small image intensity changes and flipping. The sparse depth modality have seen even less as intensity transformations alter the scale of the 3D scene, and geometric transformations may decimate the sparse points during resampling. We propose a method that unlocks a wide range of previously-infeasible geometric augmentations for unsupervised depth completion. This is achieved by reversing, or ``undo"-ing, geometric transformations to the coordinates of the output depth, warping the depth map back to the original reference frame. This enables computing the reconstruction losses using the original images and sparse depth maps, eliminating the pitfalls of naive loss computation on the augmented inputs. This simple yet effective strategy allows us to scale up augmentations to boost performance. We demonstrate our method on indoor (VOID) and outdoor (KITTI) datasets where we improve upon recent methods by an average of 11.75\% across both datasets. Code available: https://anonymous.4open.science/r/ECCV2024_AugUndo


# 91
Strong Double Blind
High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior

Shen Jianbing · Wencheng Han

In the area of self-supervised monocular depth estimation, models that utilize rich-resource inputs, such as high-resolution and multi-frame inputs, typically achieve better performance than models that use ordinary single image input. However, these rich-resource inputs may not always be available, limiting the applicability of these methods in general scenarios. In this paper, we propose Rich-resource Prior Depth estimator (RPrDepth), which only requires single input image during the inference phase but can still produce highly accurate depth estimations comparable to rich-resource based methods. Specifically, we consider rich-resource inputs as a prior information and extract rich-resource features from them during the training phase. When estimating the depth for a single-image image, we search for similar pixels from the rich-resource features and use them as prior information to estimate the depth. Experimental results demonstrate that our model outperform other single-image model and can achieve comparable or even better performance than models with rich-resource inputs, only using low-resolution single-image input.


# 87
Strong Double Blind
Weakly-supervised Camera Localization by Ground-to-satellite Image Registration

Yujiao Shi · HONGDONG LI · Akhil Perincherry · Ankit Vora

The ground-to-satellite image matching/retrieval was initially proposed for city-scale ground camera localization. This work addresses the problem of improving camera pose accuracy by ground-to-satellite image matching after a coarse location and orientation have been obtained, either from the city-scale retrieval or from consumer-level GPS and compass sensors. Existing learning-based methods for solving this task require accurate GPS labels of ground images for network training. However, obtaining such accurate GPS labels is difficult, often requiring an expensive Real Time Kinematics (RTK) setup and suffering from signal occlusion, multi-path signal disruptions, \etc. To alleviate this issue, this paper proposes a weakly supervised learning strategy for ground-to-satellite image registration when only noisy pose labels for ground images are available for network training. It derives positive and negative satellite images for each ground image and leverages contrastive learning to learn feature representations for ground and satellite images useful for translation estimation. We also propose a self-supervision strategy for cross-view image relative rotation estimation, which trains the network by creating pseudo query and reference image pairs. Experimental results show that our weakly supervised learning strategy achieves the best performance on cross-area evaluation compared to recent state-of-the-art methods that are reliant on accurate pose labels for supervision.


# 72
Strong Double Blind
Benchmarking the Robustness of Cross-view Geo-localization Models

Qingwang Zhang · Yingying Zhu

Cross-view geo-localization serves as a viable alternative to providing geographical location information when GPS signals are unstable or unavailable by matching ground images with geo-tagged aerial image databases. While significant progress has been made on some common benchmarks like CVUSA and CVACT, there remains a lack of comprehensive consideration for robustness against real-world environmental challenges such as adverse weather or sensor noise. This deficiency poses a significant challenge for deploying this technology in safety-critical domains like autonomous driving and robot navigation. To the best of our knowledge, there is currently no specialized benchmark for evaluating the robustness of cross-view geo-localization. To comprehensively and fairly evaluate the robustness of cross-view geo-localization models in real-world scenarios, we introduce 16 common types of data corruption. By synthesizing these corruptions on public datasets, we establish two fine-grained corruption robustness benchmarks (CVUSA-C and CVACTval-C) and three comprehensive corruption robustness benchmarks (CVUSA-C-ALL, CVACTval-C-ALL, and CVACT_test-C-ALL), covering approximately 1.5 million corrupted images. Subsequently, we conduct large-scale experiments on various cross-view geo-localization models to evaluate their robustness in corrupted environments and derive novel insights. Finally, we explore two data augmentation strategies as potential solutions to enhance model robustness. Combined with the training strategies proposed, these approaches effectively enhance the robustness of multiple models. We provide benchmarks and codes in the supplementary materials to be helpful for future studies.


# 68
Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

I-HSIANG CHEN · Wei-Ting Chen · Yu-Wei Liu · Ming-Hsuan Yang · Sy-Yen Kuo

Crowd counting and localization have become increasingly important in computer vision due to their wide-ranging applications. While point-based strategies have been widely used in crowd counting methods, they face a significant challenge, i.e., the lack of an effective learning strategy to guide the matching process. This deficiency leads to instability in matching point proposals to target points, adversely affecting overall performance. To address this issue, we introduce an effective approach to stabilize the proposal-target matching in point-based methods. We propose Auxiliary Point Guidance (APG) to provide clear and effective guidance for proposal selection and optimization, addressing the core issue of matching uncertainty. Additionally, we develop Implicit Feature Interpolation (IFI) to enable adaptive feature extraction in diverse crowd scenarios, further enhancing the model's robustness and accuracy. Extensive experiments demonstrate the effectiveness of our approach, showing significant improvements in crowd counting and localization performance, particularly under challenging conditions. The source codes and trained models will be made publicly available.


# 73
Strong Double Blind
Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

Zhili Chen · Shuangjie Xu · Maosheng Ye · Zian Qian · Xiaoyi Zou · Dit-Yan Yeung · Qifeng Chen

The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement. Our source code will be made publicly available.


# 74
GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Ziying Song · Lei Yang · Shaoqing Xu · Lin Liu · Dongyang Xu · Caiyan Jia · Feiyang Jia · Li Wang

Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representations has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called GraphBEV. Addressing errors caused by inaccurate point cloud projection, we introduce a LocalAlign module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a GlobalAlign module to rectify the misalignment between LiDAR and camera BEV features. Our GraphBEV framework achieves state-of-the-art performance, with an mAP of 70.1%, surpassing BEVFusion by 1.6% on the nuScnes validation set. Importantly, our GraphBEV outperforms BEVFusion by 8.3% under conditions with misalignment noise.


# 75
Strong Double Blind
Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training

qiangqiang wu · Yan Xia · Jia Wan · Antoni Chan

3D single object tracking (SOT) is an essential task in autonomous driving and robotics. However, learning robust 3D SOT trackers remains challenging due to the limited category-specific point cloud data and the inherent sparsity and incompleteness of LiDAR scans. To tackle these issues, we propose a unified 3D SOT framework that leverages 3D generative pre-training and learns robust 3D matching abilities from 2D pre-trained foundation trackers. Our framework features a consistent target-matching architecture with the widely used 2D trackers, facilitating the transfer of 2D matching knowledge. Specifically, we first propose a lightweight Target-Aware Projection (TAP) module, allowing the pre-trained 2D tracker to work well on the projected point clouds without further fine-tuning. We then propose a novel IoU-guided matching-distillation framework that utilizes the powerful 2D pre-trained trackers to guide 3D matching learning in the 3D tracker, i.e., the 3D template-to-search matching should be consistent with its corresponding 2D template-to-search matching obtained from 2D pre-trained trackers. Our designs are applied to two mainstream 3D SOT frameworks: memory-less Siamese and contextual memory-based approaches, which are respectively named SiamDisst and MemDisst. Extensive experiments show that SiamDisst and MemDisst achieve state-of-the-art performance on KITTI, Waymo Open Dataset and nuScenes benchmarks, while running at the above real-time speed of 25 and 90 FPS on a single RTX3090 GPU. The code will be made publicly available.


# 67
Strong Double Blind
LEROjD: Lidar Extended Radar-Only Object Detection

Patrick Palmer · Martin Krüger · Stefan Schütte · Richard Altendorfer · Ganesh Adam · Torsten Bertram

Accurate 3D object detection is vital for automated driving perception. While lidar sensors are well suited for this task, they are expensive and have limitations in adverse weather conditions. 3+1D imaging radar sensors offer a cost-effective, robust alternative but face challenges due to their low resolution and high measurement noise. Existing 3+1D imaging radar datasets include both radar and lidar data, enabling cross-modal model improvements. Although lidar shall not be used during inference, it can aid the training of a radar-only object detector. We explore two strategies to transfer knowledge from the lidar to the radar domain and radar-only object detectors: 1. multi-stage training with sequential lidar point cloud thin-out, and 2. cross-modal knowledge distillation. In the multi-stage process, three thin-out methods are examined. Our results show significant performance gains of up to 4.2 percentage points in mean Average Precision with multi-stage training and up to 3.9 percentage points with knowledge distillation by initializing the student with the teacher's weights. The main benefit of these approaches is their applicability to other 3D object detection networks without altering their architecture, as we show by analyzing it on two different object detectors.


# 77
Strong Double Blind
Towards Stable 3D Object Detection

Jiabao Wang · Qiang Meng · Guochao Liu · Liujiang Yan · Ke Wang · Ming-Ming Cheng · Qibin Hou

In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the community. To bridge this gap, this work proposes Stability Index (SI), a new metric that can comprehensively evaluate the stability of 3D detectors in terms of confidence, box localization, extent, and heading. By benchmarking state-of-the-art object detectors on the Waymo Open Dataset, SI reveals interesting properties of object stability that have not been previously discovered by other metrics. To help models improve their stability, we further introduce a general and effective training strategy, called Prediction Consistency Learning (PCL). PCL essentially encourages the prediction consistency of the same objects under different timestamps and augmentations, leading to enhanced detection stability. Furthermore, we examine the effectiveness of PCL with the widely-used CenterPoint, and achieve a remarkable SI of 86.00 for vehicle class, surpassing the baseline by 5.48. We hope our work could serve as a reliable baseline and draw the community's attention to this crucial issue in 3D object detection. Codes will be made publicly available.


# 76
ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Jinke Li · Xiao He · Chonghua Zhou · Xiaoqiang Cheng · Yang Wen · Dan Zhang

3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, such as map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes and benchmark will be released soon.


# 92
EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Amir Bar · Arya Bakhtiar · Danny L Tran · Antonio Loquercio · Jathushan Rajasegaran · yann lecun · Amir Globerson · Trevor Darrell

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. This work provides evidence that today's pets could be a valuable resource for training future AI systems and robotic assistants.


# 300
WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Jiachen Lu · Ze Huang · Zeyu Yang · Zhang Jiahui · Li Zhang

Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.


# 81
Strong Double Blind
Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction

Hyeongseok Jeon · Sanmin Kim · Abi Rahman Syamil · Junsoo Kim · Dongsuk Kum

Predicting the maneuvers of surrounding vehicles is imperative for the safe navigation of autonomous vehicles. However, naturalistic driving datasets tend to be highly imbalanced, with a bias towards the "going straight" maneuver. Consequently, learning and accurately predicting turning maneuvers pose significant challenges. In this study, we propose a novel two-stage maneuver learning method that can overcome such strong biases by leveraging two heterogeneous datasets in a complementary manner. In the first training phase, we utilize an intersection-centric dataset characterized by balanced distribution of maneuver classes to learn the representations of each maneuver. Subsequently, in the second training phase, we incorporate an ego-centric driving dataset to account for various geometrical road shapes, by transferring the knowledge of geometric diversity to the maneuver prediction model. To facilitate this, we constructed an in-house intersection-centric trajectory dataset with a well-balanced maneuver distribution. By harnessing the power of heterogeneous datasets, our framework significantly improves maneuver prediction performance, particularly for minority maneuver classes such as turning maneuvers. The dataset will be made publicly available soon.


# 71
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

Yuanhui Huang · Wenzhao Zheng · Yunpeng Zhang · Jie Zhou · Jiwen Lu

3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.


# 78
ADMap: Anti-disturbance Framework for Vectorized HD Map Construction

Haotian Hu · Fanyi Wang · Yaonong Wang · Laifeng Hu · Jingwei Xu · Zhiwang Zhang

In the field of autonomous driving, online High-definition (HD) map construction is crucial for planning tasks. Recent studies have developed several high-performance HD map construction models to meet the demand. However, the point sequences generated by recent HD map construction models are jittery or jagged due to prediction bias and impact subsequent tasks. To mitigate this jitter issue, we propose the Anti-Disturbance Map construction framework (ADMap), which contains Multi-scale Perception Neck (MPN), Instance Interactive Attention (IIA), and Vector Direction Difference Loss (VDDL). By exploring the point sequence relations between and within instances in a cascading manner, our proposed ADMap effectively monitors the point sequence prediction process, and achieves state-of-the-art performance on the nuScenes and Argoverse2 datasets. Extensive results demonstrate its ability to produce stable and reliable map elements in complex and changing driving scenarios.


# 79
Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction

Bencheng Liao · Shaoyu Chen · Bo Jiang · Tianheng Cheng · Qian Zhang · Wenyu Liu · Chang Huang · Xinggang Wang

Online lane graph construction is a promising but challenging task in autonomous driving. Previous methods usually model the lane graph at the pixel or piece level, and recover the lane graph by pixel-wise or piece-wise connection, which breaks down the continuity of the lane and results in suboptimal performance. Human drivers focus on and drive along the continuous and complete paths instead of considering lane pieces. Autonomous vehicles also require path-specific guidance from lane graph for trajectory planning. We argue that the path, which indicates the traffic flow, is the primitive of the lane graph. Motivated by this, we propose to model the lane graph in a novel path-wise manner, which well preserves the continuity of the lane and encodes traffic information for planning. We present a path-based online lane graph construction method, termed LaneGAP, which end-to-end learns the path and recovers the lane graph via a Path2Graph algorithm. We qualitatively and quantitatively demonstrate the superior accuracy and efficiency of LaneGAP over conventional pixel-based and piece-based methods on the challenging nuScenes and Argoverse2 datasets under controllable and fair conditions. Compared to the recent state-of-the-art piece-wise method TopoNet on the OpenLane-V2 dataset, LaneGAP still outperforms by 1.6 mIoU, further validating the effectiveness of path-wise modeling. Abundant visualizations in the supplementary material show LaneGAP can cope with diverse traffic conditions. Code and models will be released for facilitating future research.


# 83
Strong Double Blind
CarFormer: Self-Driving with Learned Object-Centric Representations

Shadi Hamdan · Fatma Guney

The choice of representation plays a key role in self-driving. Bird’s eye view (BEV) representations have shown remarkable performance in recent years. In this paper, we propose to learn object-centric representations in BEV to distill a complex scene into more actionable information for self-driving. We first learn to place objects into slots with a slot attention model on BEV sequences. Based on these object-centric representations, we then train a transformer to learn to drive as well as reason about the future of other vehicles. We found that object-centric slot representations outperform both scene-level and object-level approaches that use the exact attributes of objects. Slot representations naturally incorporate information about objects from their spatial and temporal context such as position, heading, and speed without explicitly providing it. Our model with slots achieves increased coverage of the provided routes and, consequently, a higher driving score, with a lower variance across multiple runs, affirming slots as a reliable alternative in object-centric approaches. Additionally, we validate our model’s performance as a world model through forecasting experiments, demonstrating its capability to accurately predict future slot representations.


# 82
Strong Double Blind
DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction

MOZHGAN POURKESHAVARZ · Arielle Zhang · Amir Rasouli

The lack of generalization capability of behavior prediction models for autonomous vehicles is a crucial concern for safe motion planning. One way to address this is via self-supervised pre-training through masked trajectory prediction. However, the existing models rely on uniform random sampling of tokens, which is sub-optimal because it implies that all components of driving scenes are equally informative. In this paper, to enable more robust representation learning, we introduce a dynamic masked self-distillation approach to identify and utilize informative aspects of the scenes, particularly those corresponding to complex driving behaviors, such as overtaking. Specifically, for targeted sampling, we propose a dynamic method that prioritizes tokens, such as trajectory or lane segments, based on their informativeness. The latter is determined via an auxiliary network that estimates token distributions. Through sampler optimization, more informative tokens are rewarded and selected as visible based on the policy gradient algorithm adopted from reinforcement learning. In addition, we propose a masked self-distillation approach to transfer knowledge from fully visible to masked scene representations. The distillation process not only enriches the semantic information within the visible token set but also progressively refines the sampling process. Further, we use an integrated training regime to enhance the model's ability to learn meaningful representations from informative tokens. Our extensive evaluation on two large-scale trajectory prediction datasets demonstrates the superior performance of the proposed method and its improved prediction robustness across different scenarios.


# 85
NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving

William Ljungbergh · Adam Tonderski · Joakim Johnander · Holger Caesar · Kalle Åström · Michael Felsberg · Christoffer Petersson

We present a versatile NeRF-based simulator for testing autonomous driving (AD) software systems, designed with a focus on sensor-realistic closed-loop evaluation and the creation of safety-critical scenarios. The simulator learns from sequences of real-world driving sensor data and enables reconfigurations and renderings of new, unseen scenarios. In this work, we use our simulator to test the responses of AD models to safety-critical scenarios inspired by the European New Car Assessment Programme (Euro NCAP). Our evaluation reveals that, while state-of-the-art end-to-end planners excel in nominal driving scenarios in an open-loop setting, they exhibit critical flaws when navigating our safety-critical scenarios in a closed-loop setting. This highlights the need for advancements in the safety and real-world usability of end-to-end planners. By publicly releasing our simulator and scenarios as an easy-to-run evaluation suite, we invite the research community to explore, refine, and validate their AD models in controlled, yet highly configurable and challenging sensor-realistic environments.


# 70
Strong Double Blind
Visual Relationship Transformation

Xiaoyu Xu · Jiayan Qiu · Baosheng Yu · Zhou Wang

What will be the relationships between objects in a novel view? We strive to answer this question by investigating a new visual cognition task, termed visual relationship transformation or VRT. Unlike prior visual relationship detection task that works on visible view images, VRT aims to predict the relationships in unseen novel views from a single observed source view. Towards solving VRT, we propose an end-to-end deep approach that, given an observed view image and inter-view transformations, learns to predict the relationships in novel views. Specifically, we introduce an equivariant graph neural network to predict the relationships between objects in novel views, which is achieved by enforcing the transformation equivariance of the learned relationship representations. Simultaneously, a relationship presentness mask is learned for pruning the invisible ones, thus enabling the visible relationship prediction in novel views. To this end, VRT provides supplementary cues for accomplishing novel-view-related tasks, such as visual grounding (VG), novel view synthesis (NVS), and pedestrian intention estimation (PIE). In the experiments, adopting VRT as a plug-in module results in considerable performance improvements in VG, NVS, and PIE across all datasets.


# 95
Strong Double Blind
Local All-Pair Correspondence for Point Tracking

Seokju Cho · Jiahui Huang · Jisu Nam · Honggyu An · Seungryong Kim · Joon-Young Lee

We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 5x faster than the current state-of-the-art.


# 111
Un-EVIMO: Unsupervised Event-based Independent Motion Segmentation

Ziyun Wang · Jinyuan Guo · Kostas Daniilidis

Event cameras are a novel type of biologically inspired vision sensor known for their high temporal resolution, high dynamic range, and low power consumption. Because of these properties, they are well-suited for processing fast motions that require rapid reactions. Although event cameras have recently shown competitive performance in unsupervised optical flow estimation, performance in detecting independently moving objects (IMOs) is lacking behind, although event-based methods would be suited for this task based on their low latency and HDR properties. Previous approaches to event-based IMO segmentation have been heavily dependent on labeled data. However, biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Due to its unsupervised nature, our method can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available. We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.


# 313
Strong Double Blind
Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo

Fengan Zhao · Qianang Zhou · Junlin Xiong

Traditional frame-based cameras have achieved impressive performance in stereo matching, yet challenges remain due to sensor constraints, such as low dynamic range and motion blur. In contrast, event cameras capture per-pixel intensity changes asynchronously with high temporal resolution, making them less prone to motion blur and offering a high dynamic range. However, the event stream provides less spatial information compared to intensity images. Although existing state-of-the-art event-based stereo methods fuse features from both modalities, they still struggle to effectively capture and represent edge details in the scene. In this paper, we propose a novel edge-guided event-image stereo network, which utilizes extra edge cues to supplement edge information during disparity estimation. Firstly, we introduce an edge-guided event-image feature fusion approach to effectively supplement edge information in the fused features. Secondly, we incorporate edge cues into the disparity update process by introducing an edge-guided motion augmentation module, further augmenting the edge information during disparity estimation. Finally, we demonstrate the superiority of our method in stereo matching by conducting experiments on the real-world dataset using joint image and event data.


# 312
Strong Double Blind
Physical-Based Event Camera Simulator

Haiqian Han · Jiacheng Lyu · Jianing Li · Henglu Wei · Cheng Li · Yajing Wei · SHU CHEN · Xiangyang Ji

Existing event camera simulators primarily focus on the process of generating video events and often overlook the entire optical path in real-world camera systems. To address this limitation, we propose a novel Physical-based Event Camera Simulator (PECS), which is able to generate a high-fidelity realistic event stream by directly interfacing with the 3D scene. Our PECS features a lens simulation block for accurate light-to-sensor chip replication and a multispectral rendering module for precise photocurrent generation. We present two spatiotemporal event metrics to assess the similarity between simulated and actual camera events. Experimental results demonstrate that our PECS outperforms four state-of-the-art simulators by a large margin in terms of event-based signal fidelity. We integrate our PECS into the UE platform to generate extensive multi-task synthetic datasets and evaluate its effectiveness in downstream vision tasks (e.g., video reconstruction). Our open-source code can be available in the supplementary material.


# 314
Strong Double Blind
REDIR: Refocus-free Event-based De-occlusion Image Reconstruction

Qi Guo · Hailong Shi · Huan Li · Jinsheng Xiao · Xingyu Gao

The employment of the event-based synthetic aperture imaging (E-SAI) technique, which has the capability to capture high-frequency light intensity variations, has facilitated its extensive application on scene de-occlusion reconstruction tasks. However, existing methods usually require prior information and have strict restriction of camera motion on SAI acquisition methods. This paper proposes a novel end-to-end refocus-free variable E-SAI de-occlusion image reconstruction approach REDIR, which can align the global and local features of the variable event data and effectively achieve high-resolution imaging of pure event streams. To further improve the reconstruction of the occluded target, we propose a perceptual mask-gated connection module to interlink information between modules, and incorporate a spatial-temporal attention mechanism into the SNN block to enhance target extraction ability of the model. Through extensive experiments, our model achieves state-of-the-art reconstruction quality on the traditional E-SAI dataset without prior information, while verifying the effectiveness of the variable event data feature registration method on our newly introduced V-ESAI dataset, which obviates the reliance on prior knowledge and extends the applicability of SAI acquisition methods by incorporating focus changes, lens rotations, and non-uniform motion.


# 322
Strong Double Blind
Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising

Guanting Dong · Yueyi Zhang · Xiaoyan Sun · Zhiwei Xiong

Recent advancements have achieved impressive results in removing Multi-Path Interference (MPI) and shot noise. However, these methods only utilize a single frame of ToF data, neglecting the correlation between frames. The multi-frame ToF denoising is still underexplored. In this paper, we propose the first learning-based framework for multi-frame ToF denoising. Different from previous frameworks, ours leverages the correlation between inter frames to guide the ToF noise removal with a confidence map. Specifically, we introduce a Dual-Correlation Estimation Module, which exploits both intra- and inter-correlation. The intra-correlation explicitly establishes the relevance between the spatial positions of geometric objects within the scene, aiding in depth residual initialization. The inter-correlation discerns variations in ToF noise distribution across different frames, thereby locating the areas with strong noise. To further leverage dual-correlation, we introduce a Confidence-guided Residual Regression Module to predict a confidence map, which guides the residual regression to prioritize the regions with strong ToF noise. The experimental evaluations have consistently shown that our approach outperforms other ToF denoising methods, highlighting its superior performance in effectively reducing strong ToF noise.


# 114
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

Homanga Bharadhwaj · Roozbeh Mottaghi · Abhinav Gupta · Shubham Tulsiani

We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation — interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables zero-shot robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes.


# 123
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Ruining Li · Chuanxia Zheng · Christian Rupprecht · Andrea Vedaldi

We introduce DragAPart, a method that, given an image and a set of drags as input, generates a new image of the same object that responds to the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. We start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.


# 127
Strong Double Blind
Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction

Guowei Xu · Jiale Tao · Wen Li · Lixin Duan

In the realm of stochastic human motion prediction (SHMP), researchers have often turned to generative models like GANS, VAEs and diffusion models. However, most previous approaches have struggled to accurately predict motions that are both realistic and coherent with past motion due to a lack of guidance on the latent distribution. In this paper, we introduce Semantic Latent Directions (SLD) as a solution to this challenge, aiming to constrain the latent space to learn meaningful motion semantics and enhance the accuracy of SHMP. SLD defines a series of orthogonal latent directions and represents the hypothesis of future motion as a linear combination of these directions. By creating such an information bottleneck, SLD excels in capturing meaningful motion semantics, thereby improving the precision of motion predictions. Moreover, SLD offers controllable prediction capabilities by adjusting the coefficients of the latent directions during the inference phase. Expanding on SLD, we introduce a set of motion queries to enhance the diversity of predictions. By aligning these motion queries with the SLD space, SLD is further promoted to more accurate and coherent motion predictions. Through extensive experiments conducted on widely used benchmarks, we showcase the superiority of our method in accurately predicting motions while maintaining a balance of realism and diversity. We intend to make our source code publicly available in the near future.


# 124
Strong Double Blind
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Xintao Lv · Liang Xu · Yichao Yan · Xin Jin · Congsheng Xu · Wu Shuwen · Yifan Liu · Lincheng Li · Mengxiao Bi · Wenjun Zeng · Xiaokang Yang

Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08M 3D HOI frames. We also annotate HIMO with detailed textual descriptions and temporal segments, benchmarking two novel tasks of HOI synthesis conditioned on either the whole text prompt or the segmented text prompts as fine-grained timeline control. To address these novel tasks, we propose a dual-branch conditional diffusion model with a mutual interaction module for HOI synthesis. Besides, an auto-regressive generation pipeline is also designed to obtain smooth transitions between HOI segments. Experimental results demonstrate the generalization ability to unseen object geometries and temporal compositions. Our data, codes, and models will be publicly available for research purposes.


# 125
ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Anindita Ghosh · Rishabh Dabral · Vladislav Golyanik · Christian Theobalt · Philipp Slusallek

Current approaches for 3D human motion synthesis generate high-quality animations of digital humans performing a wide variety of actions and gestures. However, a notable technological gap exists in addressing the complex dynamics of multi-human interactions within this paradigm. In this work, we present ReMoS, a denoising diffusion-based model that synthesizes full-body reactive motion of a person in a two-person interaction scenario. Assuming the motion of one person is given, we employ a combined spatio-temporal cross-attention mechanism to synthesize the reactive body and hand motion of the second person, thereby completing the interactions between the two. We demonstrate ReMoS across challenging two-person scenarios such as pair-dancing, Ninjutsu, kickboxing, and acrobatics, where one person’s movements have complex and diverse influences on the other. We also contribute the ReMoCap dataset for two-person interactions, containing full-body and finger motions. We evaluate ReMoS through multiple quantitative metrics, qualitative visualizations, and a user study, and also indicate usability in interactive motion editing applications. We will publicly release the code and the dataset with this paper to enable future research.


# 134
Strong Double Blind
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Kent Fujiwara · Mikihiro Tanaka · Qing Yu

With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. However, despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the temporal understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text between the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance under conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequences of events as negative samples to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider the temporal elements of motion-language alignment.


# 132
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

Wenxun Dai · Ling-Hao Chen · Jingbo Wang · Jinpeng Liu · Bo Dai · Yansong Tang

This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial-temporal control in text-conditioned motion generation suffer from significant runtime inefficiencies. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model (MLD). By employing one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM. This design enables explicit control signals to directly influence the generation process, similar to controlling other latent-free diffusion models for motion generation. By employing these techniques, our approach achieves real-time generation of human motion with text conditions and control signals. Experimental results demonstrate the remarkable generation and control capabilities of MotionLCM while maintaining real-time runtime efficiency.


# 248
Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Mi Luo · Zihui Xue · Alex Dimakis · Kristen Grauman

We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To pave the way for future advancements in this field, we curate a comprehensive exo-to-ego cross-view translation benchmark. It consists of a diverse collection of synchronized ego-exo tabletop activity video pairs from three public datasets: H2O, Aria Pilot, and Assembly101. The experimental results validate that Exo2Ego delivers photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of both synthesis quality and generalization ability to new actions.


# 242
Strong Double Blind
Self-Supervised Audio-Visual Soundscape Stylization

Tingle Li · Renhao Wang · Po-Yao Huang · Andrew Owens · Gopala Krishna Anumanchipalli

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.


# 245
TC4D: Trajectory-Conditioned Text-to-4D Generation

Sherwin Bahmani · Xian Liu · Wang Yifan · Ivan Skorokhodov · Victor Rong · Ziwei Liu · Xihui Liu · Jeong Joon Park · Sergey Tulyakov · Gordon Wetzstein · Andrea Tagliasacchi · David Lindell

Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate—they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene bounding box using rigid transformation along a spline, and we learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study.


# 240
LivePhoto: Real Image Animation with Text-guided Motion Control

Xi Chen · Zhiheng Liu · Mengting Chen · Yutong Feng · Yu Liu · Yujun Shen · Hengshuang ZHAO

Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization. The code will be made publicly available.


# 238
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Yixuan Ren · Yang Zhou · Jimei Yang · Jing Shi · Difan Liu · Feng Liu · Mingi Kwon · Abhinav Shrivastava

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapting it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle the spatial and temporal information during the training pipeline, we introduce a novel concept of appearance absorbers that detach the original appearance from the single reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions of our method to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination.


# 236
Photorealistic Video Generation with Diffusion Models

Agrim Gupta · Lijun Yu · Kihyuk Sohn · Xiuye Gu · Meera Hahn · Li Fei-Fei · Irfan Essa · Lu Jiang · Jose Lezama

We present W.A.L.T, a diffusion transformer for photorealistic video generation from text prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 x 896 resolution at 8 frames per second.


# 315
High-Fidelity and Transferable NeRF Editing by Frequency Decomposition

YISHENG HE · Weihao Yuan · Siyu Zhu · Zilong Dong · Liefeng Bo · Qixing Huang

This paper enables high-fidelity, transferable NeRF editing by frequency decomposition. Recent NeRF editing pipelines lift 2D stylization results to 3D scenes while suffering from blurry results, and fail to capture detailed structures caused by the inconsistency between 2D editings. Our critical insight is that low-frequency components of images are more multiview-consistent after editing compared with their high-frequency parts. Moreover, the appearance style is mainly exhibited on the low-frequency components, and the content details especially reside in high-frequency parts. This motivates us to perform editing on low-frequency components, which results in high-fidelity edited scenes. In addition, the editing is performed in the low-frequency feature space, enabling stable intensity control and novel scene transfer. Comprehensive experiments conducted on photorealistic datasets demonstrate the superior performance of high-fidelity and transferable NeRF editing. The code of our method will be made public.


# 247
Strong Double Blind
Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

Junsung Lee · Minsoo Kang · Bohyung Han

We propose a simple but effective training-free method tailored to diffusion-based image-to-image translation. Our approach revises the original noise prediction network of a diffusion model by incorporating a noise correction term, which is based on the progressive interpolation of the textual prompts corresponding to a given source image and a desired target image. We formulate the noise correction term as the difference between two noise predictions; one is computed from the denoising network given a target latent and an adaptive interpolation between the source and target prompt embeddings, while the other is the noise prediction given the target latent and the source prompt embedding. The final noise prediction network is given by a combination of the standard denoising term and the noise correction term, where the firmer is designed to reconstruct must-be-preserved regions while the latter aims to effectively edit regions of interest relevant to the target prompt. Extensive experiments verify that the proposed method achieves outstanding performance with fast inference time and consistently improves existing frameworks when combined with them.


# 262
Editable Image Elements for Controllable Synthesis

Jiteng Mu · Michael Gharbi · Richard Zhang · Eli Shechtman · Nuno Vasconcelos · Xiaolong Wang · Taesung Park

Diffusion models have made significant advances in text-guided synthesis tasks. However, editing user-provided images remains challenging, as the high dimensional noise input space of diffusion models is not naturally suited for image inversion or spatial editing. In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. Concretely, we learn to encode an input into image elements that can faithfully reconstruct an input image. These elements can be intuitively edited by a user, and are decoded by a diffusion model into realistic images. We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition.


# 244
Implicit Style-Content Separation using B-LoRA

Yarden Frenkel · Yael Vinker · Ariel Shamir · Danny Cohen-Or

Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.


# 241
Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

Animesh Sinha · Bo Sun · Anmol Kalia · Arantxa Casanova · Elliot Blanchard · David Yan · Winnie Zhang · Tony Nelli · Jiahui Chen · Hardik Shah · Licheng Yu · Mitesh Kumar Singh · Ankit Ramchandani · Maziar Sanjabi · Sonal Gupta · Amy L Bearman · Dhruv Mahajan

We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models (LDMs) in a distinct domain with high visual quality, prompt alignment and scene diversity. We choose sticker image generation as the target domain, as the images significantly differ from photorealistic samples typically generated by large-scale LDMs. We start with a competent text-to-image model, like Emu, and show that relying on prompt engineering with a photorealistic model to generate stickers leads to poor prompt alignment and scene diversity. To overcome these drawbacks, we first finetune Emu on millions of sticker-like images collected using weak supervision to elicit diversity. Next, we curate human-in-the-loop (HITL) Alignment and Style datasets from model generations, and finetune to improve prompt alignment and style alignment respectively. Sequential finetuning on these datasets poses a tradeoff between better style alignment and prompt alignment gains. To address this tradeoff, we propose a novel fine-tuning method called Style Tailoring, which jointly fits the content and style distribution and achieves best tradeoff. Evaluation results show our method improves visual quality by 14%, prompt alignment by 16.2% and scene diversity by 15.3%, compared to prompt engineering the base Emu model for stickers generation.


# 257
Strong Double Blind
EraseDraw : Learning to Insert Objects by Erasing Them from Images

Alper Canberk · Maksym Bondarenko · Ege Ozguroglu · Ruoshi Liu · Carl Vondrick

Creative processes such as painting often involve creating different components of an image one by one. Can we build a computational model to perform this task? Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details. We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well. Inverting the direction of object removal, we obtain high-quality data for learning to insert objects that are spatially, physically, and optically consistent with the surroundings. With this scalable automatic data generation pipeline, we can create a dataset for learning object insertion, which is used to train our proposed text-conditioned diffusion model. Qualitative and quantitative experiments have shown that our model achieves state-of-the-art results in object insertion, particularly for in-the-wild images. We show compelling results on diverse insertion prompts and images across various domains. In addition, we automate iterative insertion by combining our insertion model with beam search guided by CLIP.


# 260
Strong Double Blind
Text2Place: Affordance-aware Text Guided Human Placement

Rishubh Parihar · Harsh Gupta · Sachidanand VS · Venkatesh Babu Radhakrishnan

For a given scene, humans can easily reason for the locations and pose to place objects. Designing a computational model to reason about these affordances poses a significant challenge, mirroring the intuitive reasoning abilities of humans. This work tackles the problem of realistic human insertion in a given background scene termed as Semantic Human Placement. This task is extremely challenging given the diverse backgrounds, scale, and pose of the generated person and, finally, the identity preservation of the person. We divide the problem into the following two stages i) learning semantic masks using text guidance for localizing regions in the image to place humans and ii) subject-conditioned inpainting to place a given subject adhering the scene affordance within the semantic masks. For learning semantic mask, we leverage rich object-scene priors learned from the text-to-image generative models and optimize a novel parameterization of the semantic mask, eliminating the need of large scale training. To the best of our knowledge, we are the first one to provide an effective solution for realistic human placements in diverse real world scenes. The proposed method can generate highly realistic scene compositions while perserving the background and subject identity. Further, we present results for several downstream tasks - scene hallucination from a single or multiple generated persons and text-based attribute editing. With extensive comparisons against strong baselines, we show the superiority of our method in realistic human placement.


# 243
Strong Double Blind
ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation

Jack Lu · Ryan Teehan · Mengye Ren

In this paper we propose ProCreate, a simple and easy-to-implement method to improve sample diversity and creativity of diffusion-based image generative models and to prevent training data reproduction. ProCreate operates on a set of reference images and actively propels the generated image embedding away from the reference embeddings during the generation process. We collected a few-shot creative generation benchmark on eight different categories---encompassing different concepts, styles, and settings---in which ProCreate achieves the highest sample diversity and fidelity. Furthermore, we show that ProCreate is effective at preventing replicating training data in a large-scale evaluation using training text prompts.


# 250
Strong Double Blind
Label-free Neural Semantic Image Synthesis

Jiayi Wang · Kevin Alexander Laube · Yumeng Li · Jan Hendrik Metzen · Shin-I Cheng · Julio Borges · Anna Khoreva

Recent work has shown great progress in integrating spatial conditioning to control large, pre-trained text-to-image diffusion models. Despite these advances, existing methods describe the spatial image content using hand-crafted conditioning inputs, which are either semantically ambiguous (e.g., edges) or require expensive manual annotations (e.g., semantic segmentation). To address these limitations, we propose a new label-free way of conditioning diffusion models to enable fine-grained spatial control. We introduce the concept of neural semantic image synthesis, which uses neural layouts extracted from pre-trained foundation models as conditioning. Neural layouts are advantageous as they provide rich descriptions of the desired image, containing both semantics and detailed geometry of the scene. We experimentally show that images synthesized via neural semantic image synthesis achieve similar or superior pixel-level alignment of semantic classes compared to those created using expensive semantic label maps. At the same time, they capture better semantics, instance separation, and object orientation than other label-free conditioning options, such as edges or depth. Moreover, we show that images generated by neural layout conditioning can effectively augment real data for training various perception tasks.


# 232
Strong Double Blind
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Yifan Pu · Xia Zhuofan · Jiayi Guo · Dongchen Han · Qixiu Li · Duo Li · Yuhui Yuan · Ji Li · Yizeng Han · Shiji Song · Gao Huang · Xiu Li

This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module's complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets. Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. Source code will be released.


# 231
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Wendi Zheng · Jiayan Teng · Zhuoyi Yang · Weihan Wang · Jidong Chen · Xiaotao Gu · Yuxiao Dong · Ming Ding · Jie Tang

Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.


# 249
Context Diffusion: In-Context Aware Image Generation

Ivona Najdenkoska · Animesh Sinha · Abhimanyu Dubey · Dhruv Mahajan · Vignesh Ramanathan · Filip Radenovic

We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and context fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and the preservation of the desired image layout. This results in the ability to learn from the visual context and prompts, but also from either one of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and human evaluation demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and context fidelity compared to counterpart models.


# 227
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Zhiyu Tan · Mengping Yang · Luozheng Qin · Hao Yang · Ye Qian · Qiang Zhou · Cheng Zhang · Hao Li

Recent years have witnessed thrilling progress in Text-to-Image (T2I) generation with the advance of diffusion models. One critical prerequisite for faithful image creation is the precise understanding of input text conditions, wherein most existing models borrow CLIP text encoder to model the language input. Despite its widespread use in the field of T2I generation, CLIP text encoder has several drawbacks: it can only encode English language and its max token length is very limited, \emph{i.e.,} only $77$. Additionally, CLIP is derived from a relatively small model size thus restricts its verbal ability. To address this, this paper leverages Large Language Models (LLMs) as the text encoder to improve the language understanding of T2I diffusion models. The text features from LLMs usually support multiple languages, and the context length has longer accommodation, as well as providing superior text expression ability, thereby improving the synthesis quality. Considering that training T2I diffusion models with textual features of LLMs from scratch requires massive computational resources and a considerable amount of text-image data, we develop an innovative three-stage training strategy that effectively and efficiently integrates the existing text-to-image model with a large language model. The key ingredient of our model is a lightweight adapter that enables fast training of T2I diffusion models based on LLMs textual representations, and simultaneously, preserves the language power of LLMs. Extensive experiments results confirm that our model not only supports multiple languages but also achieves superior image generation performance, evidenced by both automatic metrics and human evaluation. Our code and models will be made released.


# 234
Strong Double Blind
Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis

Hanting Li · Hongjing Niu · Feng Zhao

In recent years, deep generative models have developed rapidly and can generate high-quality images based on input texts. Assessing the quality of generated images in a way consistent with human preferences is critical for both generative model evaluation and preferred image selection. Previous works aligned models with human preferences by training scoring models on image pairs with preference annotations (e.g., ImageReward and HPD). These carefully annotated image pairs well describe human preferences for choosing images. However, current training paradigm of these preference models is to directly maximize the preferred image score while minimizing the non-preferred image score in each image pair through cross-entropy loss. This simple and naive training paradigm mainly has two problems: 1) For image pairs of similar quality, it is unreasonable to blindly minimize the score of non-preferred images and can easily lead to overfitting. 2) The human robustness to small visual perturbations is not taken into account, resulting in the final model being unable to make stable choices. Therefore, we propose Stable Preference to redefine the training paradigm of human preference model and a anti-interference loss to improve robustness to visual disturbances. Our method achieves state-of-the-art performance on two popular text-to-image human preference datasets. Extensive ablation studies and visualizations demonstrate the rationality and effectiveness of our method.


# 230
SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

Weilong Chai · Dandan Zheng · Jiajiong Cao · Zhiquan Chen · Changbao Wang · Chenguang Ma

Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Existing acceleration methods usually require extensive training and are not universally applicable. LCM-LoRA, trainable once for diverse models, offers universality but rarely considers ensuring the consistency of generated content before and after acceleration. This paper proposes SpeedUpNet (SUN), an innovative acceleration module, to address the challenges of universality and consistency. Exploiting the role of cross-attention layers in U-Net for SD models, we introduce an adapter specifically designed for these layers, quantifying the offset in image generation caused by negative prompts relative to positive prompts. This learned offset demonstrates stability across a range of models, enhancing SUN's universality. To improve output consistency, we propose a Multi-Step Consistent (MSC) loss, which stabilizes the offset and ensures fidelity in accelerated content. Experiments on SD v1.5 show that SUN leads to an overall speedup of more than 10 times compared to the baseline 25-step DPM-solver++, and offers two extra advantages: (1) training-free integration into various fine-tuned Stable-Diffusion models and (2) state-of-the-art FIDs of the generated data set before and after acceleration guided by random combinations of positive and negative prompts.


# 233
Large-scale Reinforcement Learning for Diffusion Models

Yinan Zhang · Eric Tzeng · Yilun Du · Dmitry Kislyuk

Text-to-image diffusion models are a class of deep generative models that have demonstrated an impressive capacity for high-quality image generation. However, these models are susceptible to implicit biases that arise from web-scale text-image training pairs and may inaccurately model aspects of images we care about. This can result in suboptimal samples, model bias, and images that do not align with human ethics and preferences. In this paper, we present an effective scalable algorithm to improve diffusion models using Reinforcement Learning (RL) across a diverse set of reward functions, such as human preference, compositionality, and fairness over millions of images. We illustrate how our approach substantially outperforms existing methods for aligning diffusion models with human preferences. We further illustrate how this substantially improves pretrained Stable Diffusion (SD) models, generating samples that are preferred by humans 80.3% of the time over those from the base SD model while simultaneously improving both the composition and diversity of generated samples.


# 228
Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu · Ashkan Khakzar · Jindong Gu · Qifeng Chen · Philip Torr · Fabio Pizzati

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, easily circumvented, or harmful content classification, using large datasets for training and offering low flexibility. Here, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where we check the presence of harmful concepts in the input text embeddings. Our framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. Our method is evaluated on three datasets and against four baselines.


# 229
Arc2Face: A Foundation Model for ID-Consistent Human Faces

Foivos Paraperas Papantoniou · Alexandros Lattas · Stylianos Moschoglou · Jiankang Deng · Bernhard Kainz · Stefanos Zafeiriou

This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.


# 226
Strong Double Blind
GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images

Basudha Pal · Arunkumar Kannan · Ram Prabhakar Kathirvel · Alice O'Toole · Rama Chellappa

Significant advancements have been achieved in the domain of face generation with the adoption of diffusion models. However, diffusion models tend to amplify biases during the generative process, resulting in an uneven distribution of sensitive facial attributes such as age, gender, and race. In this paper, we introduce a novel approach to address this issue by debiasing the attributes in generated images. Our approach involves disentangling facial attributes by localizing the means within the latent space of the diffusion model using Gaussian mixture models (GMM). This method, leveraging the adaptable latent structure of diffusion models, allows us to localize the subspace responsible for generating specific attributes on-the-fly without the need for retraining. We demonstrate the effectiveness of our technique across various face datasets, resulting in fairer data generation while preserving sample quality. Furthermore, we empirically illustrate its effectiveness in reducing bias in downstream classification tasks without compromising performance by augmenting the original dataset with fairly generated data.


# 237
Closed-Loop Unsupervised Representation Disentanglement with $\beta$-VAE Distillation and Diffusion Probabilistic Feedback

Xin Jin · Bohan Li · Baao Xie · Wenyao Zhang · Jinming Liu · Ziqiang Li · Tao Yang · Wenjun Zeng

Representation disentanglement may help AI fundamentally understand the real world and thus benefit both discrimination and generation tasks. It currently has at least three unresolved core issues: (i) heavy reliance on label annotation and synthetic data --- causing poor generalization on natural scenarios; (ii) heuristic/hand-craft disentangling constraints make it hard to adaptively achieve an optimal training trade-off; (iii) lacking reasonable evaluation metric, especially for the real label-free data. To address these challenges, we propose a \textbf{C}losed-\textbf{L}oop unsupervised representation \textbf{Dis}entanglement approach dubbed \textbf{CL-Dis}. Specifically, we use diffusion-based autoencoder (Diff-AE) as a backbone while resorting to $\beta$-VAE as a co-pilot to extract semantically disentangled representations. The strong generation ability of diffusion model and the good disentanglement ability of VAE model are complementary. To strengthen disentangling, VAE-latent distillation and diffusion-wise feedback are interconnected in a closed-loop system for a further mutual promotion. Then, a self-supervised \textbf{Navigation} strategy is introduced to identify interpretable semantic directions in the disentangled latent space. Finally, a new metric based on content tracking is designed to evaluate the disentanglement effect. Experiments demonstrate the superiority of CL-Dis on applications like real image manipulation and visual analysis.


# 253
Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling

Wonwoong Cho · Hareesh Ravi · Midhun Harikumar · Vinh Khuc · Krishna Kumar Singh · Jingwan Lu · David Iseri Inouye · Ajinkya Kale

As Diffusion Models have shown promising performance, a lot of efforts have been made to improve the controllability of Diffusion Models. However, how to train Diffusion Models to have the disentangled latent spaces and how to naturally incorporate the disentangled conditions during the sampling process have been underexplored. In this paper, we present a training framework for disentangling the latent spaces of Diffusion Models. We further propose two sampling methods that can boost the realism of our Diffusion Models and also enhance the controllability. Concisely, we train Diffusion Models conditioned on two latent features, a spatial content mask, and a flattened style embedding. We rely on the inductive bias of the denoising process of Diffusion Models to encode pose/layout information in the content feature and semantic/style information in the style feature. Regarding the sampling methods, we first extend Composable Diffusion Models by breaking the conditional independence assumption to allow for some dependence between conditional inputs, which is shown to be effective in realistic generation in our experiments. Second, we propose timestep-dependent weight scheduling for content and style features to further improve the performance. We also observe better controllability of our proposed methods compared to existing methods in image manipulation and image translation.


# 252
ByteEdit: Boost, Comply and Accelerate Generative Image Editing

YUXI REN · Jie Wu · Yanzuo Lu · Huafeng Kuang · Xin Xia · Xionghui Wang · Qianqian Wang · Yixing Zhu · Pan Xie · Shiyin Wang · xuefeng xiao · Yitong Wang · Min Zheng · Lean FU

Recent advancements in diffusion-based generative image editing have sparked a profound revolution, reshaping the landscape of image outpainting and inpainting tasks. Despite these strides, the field grapples with inherent challenges, including: i) inferior quality; ii) poor consistency; iii) insufficient instrcution adherence; iv) suboptimal generation efficiency. To address these obstacles, we present ByteEdit, an innovative feedback learning framework meticulously designed to Boost, Comply, and Accelerate Generative Image Editing tasks. ByteEdit seamlessly integrates image reward models dedicated to enhancing aesthetics and image-text alignment, while also introducing a dense, pixel-level reward model tailored to foster coherence in the output. Furthermore, we propose a pioneering adversarial and progressive feedback learning strategy to expedite the model's inference speed. Through extensive large-scale user evaluations, we demonstrate that ByteEdit surpasses leading generative image editing products, including Adobe, Canva, and MeiTu, in both generation quality and consistency. ByteEdit-Outpainting exhibits a remarkable enhancement of 388% and 135% in quality and consistency, respectively, when compared to the baseline model. Experiments also verfied that our acceleration models maintains excellent performance results in terms of quality and consistency.


# 255
DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Jeongsol Kim · Geon Yeong Park · Jong Chul Ye

Reverse sampling and score-distillation have emerged as main workhorses in recent years for image manipulation using latent diffusion models (LDMs). While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called DreamSampler, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. Similar to score-distillation, DreamSampler is a model-agnostic approach applicable to any LDM architecture, but it allows both distillation and reverse sampling with additional guidance for image editing and reconstruction. Through experiments involving image editing, SVG reconstruction and etc, we demonstrate the competitive performance of DreamSampler compared to existing approaches, while providing new applications.


# 273
Strong Double Blind
Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion

Yu Cao · Shaogang Gong

In the field of few-shot image generation (FSIG) using deep generative models (DGMs), accurately estimating the distribution of target domain with minimal samples poses a significant challenge. This requires a method that can both capture the broad diversity and the true characteristics of the target domain distribution. We present Conditional Relaxing Diffusion Inversion (CRDI), an innovative 'training-free' approach designed to enhance distribution diversity in synthetic image generation. Distinct from conventional methods, CRDI does not rely on fine-tuning based on only a few samples. Instead, it focuses on reconstructing each target image instance and expanding diversity through few-shot learning. The approach initiates by identifying a Sample-wise Guidance Embedding (SGE) for the diffusion model, which serves a purpose analogous to the explicit latent codes in certain generative adversarial network (GAN) models. Subsequently, the method involves a scheduler that progressively introduces perturbations to the SGE, thereby augmenting diversity. Comprehensive experimental analysis demonstrates that our method surpasses GAN-based reconstruction techniques and equals state-of-the-art (SOTA) FSIG methods in performance. Additionally, it effectively mitigates overfitting and catastrophic forgetting, common drawbacks of fine-tuning approaches.


# 279
Strong Double Blind
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

Chirag Vashist · Shichong Peng · Ke Li

An emerging area of research aims to learn deep generative models with limited training data. Implicit Maximum Likelihood Estimation (IMLE), a recent technique, successfully addresses the mode collapse issue of GANs and has been adapted to the few-shot setting, achieving state-of-the-art performance. However, current IMLE-based approaches encounter challenges due to inadequate correspondence between the latent codes selected for training and those drawn during inference. This results in suboptimal test-time performance. To address this issue, we propose RS-IMLE, a novel approach that changes the prior distribution used for training. This leads to substantially higher-quality image generation compared to existing IMLE-based methods, as validated by a theoretical analysis and comprehensive experiments conducted on nine few-shot image datasets.


# 156
FMBoost: Boosting Latent Diffusion with Flow Matching

Johannes Schusterbauer-Fischer · Ming Gui · Pingchuan Ma · Nick Stracke · Stefan Andreas Baumann · Tao Hu · Bjorn Ommer

Visual synthesis has recently seen significant leaps in performance, inparticular due to breakthroughs in generative models. Diffusion models have been a key enabler as they excel in image diversity. This, however, comes at the prize of slow training and synthesis which are only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate that introducing flow matching between a frozen diffusion model and convolutional decoder enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then provide the necessary visual diversity effectively, while flow matching efficiently enhances resolution and details by mapping the small to a high-dimensional latent space. These latents are then projected to high-resolution images by the subsequent convolutional decoder of the diffusion approach. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, achieves state-of-the-art high-resolution image synthesis at $1024^2$ pixels with minimal computational cost. Cascading our model optionally boosts this further to $2048^2$ pixels. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.


# 339
AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation

Shengkun Tang · Yaqing Wang · Caiwen Ding · Yi Liang · Yao Li · Dongkuan Xu

Diffusion models achieve great success in generating diverse and high-fidelity images, yet their widespread application, especially in real-time scenarios, is hampered by their inherently slow generation speed. The slow generation stems from the necessity of multi-step network inference. While some certain predictions benefit from the full computation of the model in each sampling iteration, not every iteration requires the same amount of computation, potentially leading to inefficient computation use. Unlike typical adaptive computation challenges that deal with single-step generation problems, diffusion processes with a multi-step generation need to dynamically adjust their computational resource allocation based on the ongoing assessment of each step's importance to the final image output, presenting a unique set of challenges. In this work, we propose AdaDiff, an adaptive computational framework that dynamically allocates computation resources in each sampling step to improve the generation efficiency of diffusion models. To assess the effects of changes in computational effort on image quality, we present a timestep-aware uncertainty estimation module (UEM) designed for diffusion models. Integrated at each intermediate layer, the UEM evaluates the predictive uncertainty of that layer. This uncertainty measurement serves as a crucial indicator for determining whether to early exit the inference process. Additionally, we introduce an uncertainty-aware layer-wise loss mechanism aimed at bridging the performance divide between full models and their early-exited counterparts. Utilizing this loss strategy enables our model to achieve results on par with full-layer models. Comprehensive experiments are conducted, including class-conditional, unconditional, and text-guided image generation across multiple datasets, our approach has demonstrated superior performance and efficiency relative to current early exiting techniques in diffusion models. Notably, we observe enhanced performance in terms of the FID, with a notable acceleration ratio reduction of around 45%. Another exciting observation is that adaptive computation can synergize with other efficiency-enhancing methods, such as reducing sampling steps and weight pruning to further accelerate inference and boost the performance. Full code and model are released for reproduction.


# 267
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

Fu-Yun Wang · Xiaoshi Wu · Zhaoyang Huang · Xiaoyu Shi · Dazhong Shen · Guanglu Song · Yu Liu · Hongsheng LI

Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA (\textbf{M}astering Video \textbf{O}utpainting \textbf{T}hrough \textbf{I}nput-Specific \textbf{A}daptation), a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. \ours{} comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies including spatial-aware insertion and noise travel are proposed to better leverage the diffusion model's generative prior and the acquired video patterns from source videos. Extensive evaluations underscore MOTIA's superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning.


# 320
Strong Double Blind
L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model

Yuchen Hong · Haofeng Zhong · Shuchen Weng · Jinxiu Liang · Boxin Shi

In this paper, we introduce L-DiffER, a language-based diffusion model designed for the ill-posed single image reflection removal task. Although having shown impressive performance for image generation, existing language-based diffusion models struggle with precise control and faithfulness in image restoration. To overcome these limitations, we propose an iterative condition refinement strategy to resolve the problem of inaccurate control conditions. A multi-condition constraint mechanism is employed to ensure the recovery faithfulness of image color and structure while retaining the generation capability to handle low-transmitted reflections. We demonstrate the superiority of the proposed method through extensive experiments, showcasing both quantitative and qualitative improvements over existing methods.


# 328
Strong Double Blind
LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Ye Yu · Fengxin Chen · Jun Yu · Zhen Kan

While recent low-light image enhancement (LLIE) methods have made significant advancements, they still face challenges in terms of low visual quality and weak generalization ability when applied to complex scenarios. To address these issues, we propose a semi-supervised method based on latent mean-teacher and Gaussian process, named LMT-GP. We first design a latent mean-teacher framework that integrates both labeled and unlabeled data, as well as their latent vectors, into model training. Meanwhile, we use a mean-teacher-assisted Gaussian process learning strategy to establish a connection between the latent and pseudo-latent vectors obtained from the labeled and unlabeled data. To guide the learning process, we utilize an assisted Gaussian process regression (GPR) loss function. Furthermore, we design a pseudo-label adaptation module (PAM) to ensure the reliability of the network learning. To demonstrate our method's generalization ability and effectiveness, we apply it to multiple LLIE datasets and high-level vision tasks. Experiment results demonstrate that our method achieves high generalization performance and image quality. We will make our code publicly available.


# 316
Strong Double Blind
Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery

Chao Wang · Zhedong Zheng · Ruijie Quan · Yi Yang

In this paper, we delve into Blind Image Decomposition (BID) tailored for real-world scenarios, aiming to uniformly recover images from diverse, unknown weather combinations and intensities. Our investigation uncovers one inherent gap between the controlled lab settings and the complex real-world environments. In particular, existing BID methods and datasets usually overlook the physical property that adverse weather varies with scene depth rather than a uniform depth, thus constraining their efficiency on real-world photos. To address this limitation, we design an end-to-end Depth-aware Blind Network, namely DeBNet, to explicitly learn the depth-aware transmissivity maps, and further predict the depth-guided noise residual to jointly produce the restored output. Moreover, we employ neural architecture search to adaptively find optimal architectures within our specified search space, considering significant shape and structure differences between multiple degradations. To verify the effectiveness, we further introduce two new BID datasets, namely BID-CityScapes and BID-GTAV, which simulate depth-aware degradations on real-world and synthetic outdoor images, respectively. Extensive experiments on both existing and proposed benchmarks show the superiority of our method over state-of-the-art approaches.


# 319
Strong Double Blind
Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

Yeying Jin · Xin Li · Jiadong Wang · Yan Zhan · Malu Zhang

Existing raindrop removal datasets have two shortcomings. First, they consist of images captured by cameras with a focus on the background, leading to the presence of blurry raindrops. To our knowledge, none of these datasets include images where the focus is specifically on raindrops, which results in a blurry background. Second, these datasets predominantly consist of daytime images, thereby lacking nighttime raindrop scenarios. Consequently, algorithms trained on these datasets may struggle to perform effectively in raindrop-focused or nighttime scenarios. The absence of datasets specifically designed for raindrop-focused and nighttime raindrops constrains research in this area. In this paper, we introduce a large-scale, real-world raindrop removal dataset called Raindrop Clarity. Raindrop Clarity comprises 15,186 high-quality pairs/triplets (raindrops, blur, and background) of images with raindrops and the corresponding clear background images. There are 5,442 daytime raindrop images and 9,744 nighttime raindrop images. Specifically, the 5,442 daytime images include 3,606 raindrop- and 1,836 background-focused images. While the 9,744 nighttime images contain 4,834 raindrop- and 4,906 background-focused images. Our dataset will enable the community to explore background-focused and raindrop-focused images, including challenges unique to daytime and nighttime conditions.


# 334
XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

Yunpeng Qu · Kun Yuan · Kai Zhao · Qizhi Xie · Jinhua Hao · Ming Sun · Chao Zhou

Diffusion-based methods, endowed with a formidable generative prior, have received increasing attention in Image Super-Resolution (ISR) recently. However, as low-resolution (LR) images often undergo severe degradation, it is challenging for ISR models to perceive the semantic and degradation information, resulting in restoration images with incorrect content or unrealistic artifacts. To address these issues, we propose a \textit{Cross-modal Priors for Super-Resolution (XPSR)} framework. Within XPSR, to acquire precise and comprehensive semantic conditions for the diffusion model, cutting-edge Multimodal Large Language Models (MLLMs) are utilized. To facilitate better fusion of cross-modal priors, a \textit{Semantic-Fusion Attention} is raised. To distill semantic-preserved information instead of undesired degradations, a \textit{Degradation-Free Constraint} is attached between LR and its high-resolution (HR) counterpart. Quantitative and qualitative results show that XPSR is capable of generating high-fidelity and high-realism images across synthetic and real-world datasets. The model and codes will be made publicly available.


# 332
Strong Double Blind
AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution

Yuanting Fan · Chengxu Liu · Nengzhong Yin · Changlong Gao · Xueming Qian

Diffusion models (DMs) have shown promising results on single-image super-resolution and other image-to-image translation tasks. Benefiting from more computational resources and longer inference times, they are able to yield more realistic images. Existing DMs-based super-resolution methods try to achieve an overall average recovery over all regions via iterative refinement, ignoring the consideration that different input image regions require different timesteps to reconstruct. In this work, we notice that previous DMs-based super-resolution methods suffer from wasting computational resources to reconstruct invisible details. To further improve the utilization of computational resources, we propose AdaDiffSR, a DMs-based SR pipeline with dynamic timesteps sampling strategy (DTSS). Specifically, by introducing the multi-metrics latent entropy module (MMLE), we can achieve dynamic perception of the latent spatial information gain during the denoising process, thereby guiding the dynamic selection of the timesteps. In addition, we adopt a progressive feature injection module (PFJ), which dynamically injects the original image features into the denoising process based on the current information gain, so as to generate images with both fidelity and realism. Experiments show that our AdaDiffSR achieves comparable performance over current state-of-the-art DMs-based SR methods while consuming less computational resources and inference time on both synthetic and real-world datasets.


# 325
Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration

shihao zhou · Jinshan Pan · Jinglei Shi · Duosheng Chen · Lishen Qu · Jufeng Yang

How to explore useful features from images as prompts to guide the deep image restoration models is an effective way to solve image restoration. In contrast to mining spatial relations within images as prompt, which leads to characteristics of different frequencies being neglected and further remaining subtle or undetectable artifacts in the restored image, we develop a Frequency Prompting image restoration method, dubbed FPro, which can effectively provide prompt components from a frequency perspective to guild the restoration model address these differences. Specifically, we first decompose input features into separate frequency parts via dynamically learned filters, where we introduce a gating mechanism for suppressing the less informative elements within the kernels. To propagate useful frequency information as prompt, we then propose a dual prompt block, consisting of a low-frequency prompt modulator (LPM) and a high-frequency prompt modulator (HPM), to handle signals from different bands respectively. Each modulator contains a generation process to incorporate prompting components into the extracted frequency maps, and a modulation part that modifies the prompt feature with the guidance of the decoder features. Experimental results on commonly used benchmarks have demonstrated the favorable performance of our pipeline against SOTA methods on 5 image restoration tasks, including deraining, deraindrop, demoireing, deblurring, and dehazing. The source code is provided in the supplementary materials.


# 330
Strong Double Blind
Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

chen rao · Guangyuan Li · Zehua Lan · Jiakai Sun · Junsheng Luan · Wei Xing · Lei Zhao · Huaizhong Lin · Jianfeng Dong · Dalong Zhang

Current video deblurring methods have limitations in recovering high-frequency detail information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the following problems: (1) DMs require many iteration steps to generate videos from Gaussian noise, which consumes many computational resources. (2) DMs are easily misled by the blurry artifacts in the video, resulting in irrational content and distortion of the deblurred video. To address the above issues, we propose a novel video deblurring framework VD-Diff that integrates the diffusion model into the Wavelet-Aware Dynamic Transformer (WADT). Specifically, we perform the diffusion model in a highly compact latent space to generate prior features containing high-frequency information that conforms to the ground truth distribution. We design the WADT to preserve and recover the low-frequency global information in the video while utilizing the high-frequency information generated by the diffusion model. Extensive experiments show that our proposed VD-Diff outperforms SOTA methods on GoPro, DVD, BSD, and Real-World Video datasets.


# 327
Strong Double Blind
BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow

EungGu Kang · Byeonghun Lee · Sunghoon Im · Kyong Hwan Jin

Learning based-multi frame super-resolution (MFSR) achieves higher performance than single image super-resolution (SISR), because MFSR leverages abundant information from multiple frames. Recent MFSR approaches adapt the deformable convolution network (DCN) to align the frames. However, the existing MFSR suffers from misalignments between the reference and source frames due to the limitations of DCN, such as small receptive fields and the predefined number of kernels. From these problems, existing MFSR approaches struggle to represent high-frequency information. To this end, we propose Deep Burst Multi-scale SR using Fourier Space with Optical Flow (BurstM). The proposed method estimates the optical flow offset for accurate alignment and predicts the continuous Fourier coefficient of each frame for representing high-frequency textures. In addition, we have enhanced the network's flexibility by supporting various super-resolution (SR) scale factors with the unimodel. We demonstrate that our method has the highest performance and flexibility than the existing MFSR methods.


# 323
Strong Double Blind
DualDn: Dual-domain Denoising via Differentiable ISP

Ruikang Li · Yujin Wang · Shiqi Chen · Fan Zhang · Jinwei Gu · Tianfan Xue

Image denoising is a critical step in the Image Signal Processor (ISP) of a camera. There are two typical ways to inject a denoiser into the ISP pipeline: a raw domain denoiser that is directly applied to captured raw frames, and an sRGB domain denoiser that is applied to the sRGB image output by the ISP. However, both approaches have their limitations. The residual noise from the raw-domain denoising will be amplified by the ISP pipeline, and the sRGB domain cannot handle spatially varying noise as it only sees noise distorted by ISP processing. As a result, most raw-domain or sRGB-domain denoising works only for specific noise distributions and ISP configurations. To address these challenges, we propose DualDn, a novel learning-based dual-domain denoising. Unlike previous single-domain denoising, DualDn consists of two denoising networks, one in the raw domain and one in the sRGB domain. The raw domain denoising can adapt to spatially varying noise levels, and the sRGB domain denoising can remove the residual noise amplified by the ISP. Both denoising networks are connected with a differentiable ISP, which is trained end-to-end and discarded during the inference stage. With this design, DualDn achieves greater generalizability compared to most learning-based denoising, as it can adapt to different unseen noises, ISP parameters, and even novel ISP pipelines. Experiments show that DualDn achieves state-of-the-art performance and can adapt to different denoising network architectures. Moreover, DualDn can be used as a plug-and-play denoising module with real cameras without retraining, and still demonstrate better performance than commercial on-camera denoising, further showing its generalization ability.


# 329
Strong Double Blind
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Ping Wang · Yulun Zhang · Lishun Wang · Xin Yuan

Recently, deep learning models have achieved impressive success on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, \ie, reconstructing multiple high-fidelity frames from a single-shot observation. However, existing works lack an insight into the mixed degradation of spatial masking and temporal aliasing, and empirically follow the designs of successful plain video restoration (\eg, denoising, deblurring) models, limiting the overall performance. In this work, we tailor a network architecture and a Hierarchical Separable Video Transformer (HiSViT) as building block, composed of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN). CSS-MSA decomposes spatio-temporal similarity calculations into spatial and temporal dimensions but attends to all spatio-temporal tokens at a controllable scale within a single attention layer. GSM-FFN is design to bring locality to CSS-MSA via gated mechanism and space-time separable convolutions. HiSViT is built by multiple groups of CSS-MSA plus GSM-FFN, each of which focuses on different scales, enabling multi-scale interaction and long-range modeling. Extensive experiments demonstrate that our model achieves the state-of-the-art performance.


# 209
Strong Double Blind
Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation

han li · Shaohui Li · Shuangrui Ding · Wenrui Dai · Maida Cao · Chenglin Li · Junni Zou · Hongkai Xiong

Image compression for machine and human vision (ICMH) has gained increasing attention in recent years. Existing ICMH methods are limited by high training and storage overheads due to heavy design of task-specific networks. To address this issue, in this paper, we develop a novel lightweight adapter-based tuning framework for ICMH, named Adapt-ICMH, that better balances task performance and bitrates with reduced overheads. We propose a spatial-frequency modulation adapter (SFMA) that simultaneously eliminates non-semantic redundancy with a spatial modulation adapter, and enhances task-relevant frequency components and suppresses task-irrelevant frequency components with a frequency modulation adapter. The proposed adapter is plug-and-play and compatible with almost all existing learned image compression models without compromising the performance of pre-trained models. Experiments demonstrate that Adapt-ICMH consistently outperforms existing ICMH frameworks on various machine vision tasks with fewer fine-tuned parameters and reduced computational complexity.


# 326
Strong Double Blind
Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery

Jian-Li Wang · Xi-Le Zhao

Recently, the transform-based low-rank tensor factorization (t-LRTF) has emerged as a promising tool for multi-dimensional data recovery. However, the discrete transforms along the third (i.e., temporal/spectral) dimension are dominating in existing t-LRTF methods, which hinders their performance in addressing temporal/spectral degeneration scenarios, e.g., video frame interpolation and multispectral image (MSI) spectral super-resolution. To break this barrier, we propose a novel Functional Transform-based Low-Rank Tensor Factorization (FLRTF), where the learnable functional transform is expressed by the implicit neural representation with positional encodings. The continuity brought by this function allows FLRTF to capture the smoothness of data in the third dimension, which will benefit the recovery of temporal/spectral degeneration problems. To examine the effectiveness of FLRTF, we establish a general FLRTF-based multi-dimensional data recovery model. Experimental results, including video frame interpolation/extrapolation, MSI band interpolation, and MSI spectral super-resolution tasks, substantiate that FLRTF has superior performance as compared with representative data recovery methods.


# 162
Strong Double Blind
Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

Sojin Lee · Dogyun Park · Inho Kong · Hyunwoo J. Kim

Recent studies on inverse problems have proposed posterior samplers that leverage the pre-trained diffusion models as a powerful prior. The attempts have paved the way for using diffusion models in a wide range of inverse problems. However, the existing methods entail computationally demanding iterative sampling procedures and optimize a separate solution for each measurement, which leads to limited scalability and lack of generalization capability across unseen samples. To address these limitations, we propose a novel approach, Diffusion prior-based Amortized Variational Inference (DAVI) that solves inverse problems with a diffusion prior from an amortized variational inference perspective. Specifically, instead of the separate measurement-wise optimization, our amortized inference learns a function that directly maps measurements to the implicit posterior distributions of corresponding clean data, enabling a single-step posterior sampling even for unseen measurements. The proposed method learns the function by minimizing the Kullback-Leibler divergence between the implicit distributions and the true posterior distributions with multiple measurements using objectives derived based on variational inference. Extensive experiments across three image restoration tasks, e.g., Gaussian deblur, 4x super-resolution, and box inpainting with two benchmark datasets, demonstrate our superior performance over strong diffusion model-based methods.


# 341
Strong Double Blind
Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images

Frederik Hoppe · Claudio Mayrink Verdun · Hannah Sophie Laus · Sebastian Endt · Marion Irene Menzel · Felix Krahmer · Holger Rauhut

Establishing certified uncertainty quantification (UQ) in imaging processing applications continues to pose a significant challenge. In particular, such a goal is crucial for accurate and reliable medical imaging if one aims for precise diagnostics and appropriate intervention. In the case of magnetic resonance imaging, one of the essential tools of modern medicine, enormous advancements in fast image acquisition were possible after the introduction of compressive sensing and, more recently, deep learning methods. Still, as of now, there is no UQ method that is both fully rigorous and scalable. This work takes a step towards closing this gap by proposing a total variation minimization-based method for pixel-wise sharp confidence intervals for undersampled MRI. We demonstrate that our method empirically achieves the predicted confidence levels. We expect that our approach will also have implications for other imaging modalities as well as deep learning applications in computer vision.


# 340
Strong Double Blind
Energy-induced Explicit quantification for Multi-modality MRI fusion

Xiaoming Qi · Yuan Zhang · Tong Wang · Guanyu Yang · Yueming Jin · Shuo Li

Multi-modality magnetic resonance imaging (MRI) is crucial for accurate disease diagnosis and surgical planning by comprehensively analyzing multi-modality information fusion. This fusion is characterized by unique patterns of information aggregation for each disease across modalities, influenced by distinct inter-dependencies and shifts in information flow. Existing fusion methods implicitly identify distinct aggregation patterns for various tasks, indicating the potential for developing a unified and explicit aggregation pattern. In this study, we propose a novel aggregation pattern, Energy-induced Explicit Propagation and Alignment (E2PA), to explicitly quantify and optimize the properties of multi-modality MRI fusion to adapt to different scenarios. In E2PA, (1) An energy-guided hierarchical fusion (EHF) uncovers the quantification and optimization of inter-dependencies propagation among multi-modalities by hierarchical same energy among patients. (2) An energy-regularized space alignment (ESA) measures the consistency of information flow in multi-modality aggregation by the alignment on space factorization and energy minimization. Through the extensive experiments on three public multi-modality MRI datasets (with different modality combinations and tasks), the superiority of E2PA can be demonstrated from the comparison with state-of-the-art methods.


# 210
Strong Double Blind
WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model

Haisheng Fu · Jie Liang · Zhenman Fang · Jingning Han · Feng Liang · Guohe Zhang

Recently learned image compression (LIC) has achieved great progress and even outperformed the traditional approaches. However, LIC mainly reduces spatial redundancy in the autoencoder networks and entropy coding, but has not fully removed the frequency-domain correlation explicitly via linear transform (such as DCT or wavelet transform), which is the cornerstone of the traditional methods. To address this critical limitation, in this paper, we propose a surprisingly simple but efficient framework, which introduces the discrete wavelet transform (DWT) to both the convolution layers and entropy coding of LIC. First, in both the core and hyperprior autoencoder networks, we propose a Wavelet-domain Convolution (WeConv) module at selected layers to reduce the frequency-domain correlation explicitly and make the signal sparser. Experimental results show that by using the simplest Harr wavelet transform, WeConv can already achieve 0.2-0.25 dB gain in the rate-distortion (R-D) performance with negligible change of model size and running time. We also perform entropy coding and quantization in the wavelet domain, and propose a Wavelet-domain Channel-wise Auto-Regressive entropy Model (WeChARM), where the latent representations are quantized and entropy coded in the wavelet domain instead of spatial domain. Moreover, the entropy coding is split into two steps. We first encode and decode all the low-frequency wavelet transform coefficients, and then use them as prior information to encode and decode the high-frequency coefficients. The channel-wise entropy coding is further used in each step. WeChARM can further improve the R-D performance by 0.25-0.3 dB, with moderate increase of model size and running time. By combining WeConv and WeChARM, the proposed WeConvene scheme achieves superior R-D performance compared to other state-of-the-art LIC methods as well as the latest H.266/VVC. In particular, it achieves a BD-rate gain of 9.11\%, 9.46\%, and 9.20\% over H.266/VVC on the Kodak, Tecnick, and CLIC datasets, respectively. Better performance can be achieved by using more advanced wavelet transforms. The proposed convolution-based system is also easier to train and has less requirements on GPU than transformer-based schemes.


# 126
Strong Double Blind
Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Rining Wu · Feixiang Zhou · Ziwei Yin · Jian Liu

Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entangled into the retinal neuronal coding patterns, where effective establishing their intrinsic temporal relationships is crucial. Recent foundation vision models have paved an advanced way of understanding image pixels. Yet, neuronal coding in the brain largely lacks a deep understanding of its alignment with pixels. Most previous studies employ static images or artificial videos derived from static images for emulating more real and complicated stimuli. Despite these simple scenarios effectively help to separate key factors influencing visual coding, complex temporal relationships receive no consideration. To decompose the temporal features of visual coding in natural scenes, here we propose Vi-ST, a spatiotemporal convolutional neural network fed with a self-supervised Vision Transformer (ViT) prior, aimed at unraveling the temporal-based encoding patterns of retinal neuronal populations. The model demonstrates robust predictive performance in generalisation tests. Additionally, through detailed ablation experiments, we demonstrate the significance of each temporal module. Furthermore, we introduce a visual coding evaluation metric designed to integrate temporal considerations and compare the impact of different numbers of neuronal populations on complementary coding. In conclusion, our proposed Vi-ST demonstrates a novel modelling framework for neuronal coding of dynamic visual scenes in the brain, effectively aligning our brain representation of video with neuronal activity.


# 264
Strong Double Blind
GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields

Xiufeng HUANG · Ka Chun Cheung · Simon See · Renjie Wan

Remarkable advancements in the recolorization of Neural Radiance Fields (NeRF) have simplified the process of modifying NeRF's color attributes. Yet, with the potential of NeRF to serve as shareable digital assets, there's a concern that malicious users might alter the color of NeRF models and falsely claim the recolorized version as their own. To safeguard against such breaches of ownership, enabling original NeRF creators to establish rights over recolorized NeRF is crucial. While approaches like CopyRNeRF have been introduced to embed binary messages into NeRF models as digital signatures for copyright protection, the process of recolorization can remove these binary messages. In our paper, we present GeometrySticker, a method for seamlessly integrating binary messages into the geometry components of radiance fields, akin to applying a sticker. GeometrySticker can embed binary messages into NeRF models while preserving the effectiveness of these messages against recolorization. Our comprehensive studies demonstrate that GeometrySticker is adaptable to prevalent NeRF architectures and maintains a commendable level of robustness against various distortions. We will release the codes once the paper is accepted.


# 223
Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification

Hai Ci · Pei Yang · Yiren Song · Mike Zheng Shou

We revisit Tree-Ring Watermarking, a recent diffusion model watermarking method that demonstrates great robustness to various attacks. We conduct an in-depth study of its framework and reveal that the distribution shift unintentionally introduced by the watermarking process, apart from watermark pattern matching, contributes to its exceptional robustness. Our investigation further exposes inherent flaws in the original design, particularly in its ability to identify multiple distinct watermark keys, where distribution shift offers no advantage. Based on the preceding analysis, we propose RingID. It consists of a novel multi-channel heterogeneous watermarking approach designed to seamlessly amalgamate distinctive advantages from diverse watermarks. Moreover, coupled with a series of suggested enhancements, RingID exhibits substantial advancements in both robustness and capacity for multi-key identification.


# 225
Strong Double Blind
Enhancing Tampered Text Detection through Frequency Feature Fusion and Decomposition

Zhongxi Chen · Shen Chen · Taiping Yao · Ke Sun · Shouhong Ding · Xianming Lin · liujuan cao · Rongrong Ji

Document image tampering poses a grave risk to the veracity of information, with potential consequences ranging from misinformation dissemination to financial and identity fraud. While current detection methods utilize frequency information to uncover tampering invisible to the naked eye, they often fall short in precisely integrating this information and enhancing the high-frequency components vital for detecting subtle tampering. Addressing these gaps, we introduce the Feature Fusion and Decomposition Network (FFDN), a novel approach for Document Image Tampering Detection (DITD). Our method synergizes Visual Enhancement Module (VEM) with a Wavelet-like Frequency Enhancement (WFE) to improve the detection of subtle tampering traces. Specifically, the VEM enhancing the detection of subtle tampering traces while maintaining the integrity of the original RGB detection capabilities, and the WFE further decomposes features into high-frequency and low-frequency components, placing emphasis on minuscule, yet critical, tampering details. Rigorous testing on the DocTamper dataset confirms FFDN's preeminence, significantly outperforming existing state-of-the-art methods in detecting tampering.


# 224
Strong Double Blind
T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Zhongqi Wang · Jie Zhang · Shiguang Shan · Xilin CHEN

While text-to-image diffusion models demonstrate impressive generation capabilities, they also exhibit vulnerability to backdoor attacks, which involve the manipulation of model outputs through malicious triggers. In this paper, for the first time, we propose a comprehensive defense method named T2IShield to detect, localize, and mitigate such attacks. Specifically, we find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger. Based on this key insight, we propose two effective backdoor detection methods: Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis. Besides, we introduce a binary-search approach to localize the trigger within a backdoor sample and assess the efficacy of existing concept editing methods in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show the effectiveness of our proposed defense method. For backdoor sample detection, T2IShield achieves a detection F1 score of 91.3% with low computational cost. Furthermore, T2IShield achieves a localization F1 score of 86.4% and invalidates 99% poisoned samples. Codes will be public soon.


# 221
Strong Double Blind
Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing

Guanghao Zheng · Yuchen Liu · Wenrui Dai · Chenglin Li · Junni Zou · Hongkai Xiong

The effectiveness of Vision Transformers (ViTs) diminishes considerably in multi-modal face anti-spoofing (FAS) under missing modality scenarios. Existing approaches rely on modality-invariant features to alleviate this issue but ignore modality-specific features. To solve this issue, we propose a Missing Modality Adapter framework for Face Anti-Spoofing (MMA-FAS), which leverages modality-disentangle adapters and LBP-guided contrastive loss for explicit combination of modality-invariant and modality-specific features. Modality-disentangle adapters disentangle features into modality-invariant and -specific features from the view of frequency decomposition. LBP-guided contrastive loss, together with batch-level and sample-level modality masking strategies, forces the model to cluster samples according to attack types and modal combinations, which further enhances modality-specific and -specific features. Moreover, we propose an adaptively modal combination sampling strategy, which dynamically adjusts the sample probability in masking strategies to balance the training process of different modal combinations. Extensive experiments demonstrate that our proposed method achieves state-of-the-art intra-dataset and cross-dataset performance in all the missing modality scenarios.


# 222
Strong Double Blind
Personalized Privacy Protection Mask Against Unauthorized Facial Recognition

Ka Ho Chow · Sihao Hu · Tiansheng Huang · Ling Liu

Face recognition (FR) can be misused for privacy intrusion. Governments, private companies, or even individual attackers can collect facial images by web scraping to build an FR system identifying human faces without their consent. This paper introduces Chameleon, which learns to generate a user-centric personalized privacy protection mask, coined as P3-Mask, to protect facial images against unauthorized FR with three salient features. First, we use a cross-image optimization to generate one P3-Mask for each user instead of tailoring facial perturbation for each facial image of a user. It enables efficient and instant protection even for users with limited computing resources. Second, we incorporate a perceptibility optimization to preserve the visual quality of the protected facial images. Third, we strengthen the robustness of P3-Mask against unknown FR models by integrating focal diversity-optimized ensemble learning into the mask generation process. Extensive experiments on two benchmark datasets show that Chameleon outperforms three state-of-the-art methods with instant protection and minimal degradation of image quality. Furthermore, Chameleon enables cost-effective FR authorization using the P3-Mask as a personalized de-obfuscation key, and it demonstrates high resilience against adaptive adversaries.


# 104
Strong Double Blind
GRAPE: Generalizable and Robust Multi-view Facial Capture

Jing Li · Di Kang · Zhenyu He

Deep learning-based multi-view facial capture methods have shown impressive accuracy while being several orders of magnitude faster than a traditional mesh registration pipeline. However, the existing systems (e.g. TEMPEH) are strictly restricted to inference on the data captured by the same camera array that is used to capture their training data. In this study, we aim to improve the generalization ability so that a trained model can be readily used for inference (i.e. capture new data) on a different camera array. To this end, we propose a more generalizable initialization module to extract the camera array-agnostic 3D feature, including a visual hull-based head localization and a visibility-aware 3D feature aggregation module enabled by the visual hull. In addition, we propose an ``update-by-disagreement'' learning strategy to better handle data noise (e.g. inaccurate registration, scan noise) by discarding potentially inaccurate supervision signals during training. The resultant \textbf{g}eneralizable and \textbf{r}obust topologically consistent multi-view facial c\textbf{ap}tur\textbf{e} system (\sysname{}) can be readily used to capture data on a different camera array, reducing great effort on data collection and processing. Experiments on the FaMoS and FaceScape datasets demonstrate the effectiveness of the proposed method.


# 198
Strong Double Blind
Seeing Faces in Things: A Model and Dataset for Pareidolia

Mark T Hamilton · Simon Stent · Vasha G DuTell · Anne Harrington · Jennifer E Corbett · Ruth Rosenholtz · William Freeman

The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. Face pareidolia'' describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective. We present an image dataset ofFaces in Things'', consisting of five thousand images from the web with human-annotated pareidolic faces. Using this dataset, we examine the extent to which a state-of-the-art human face detector exhibits pareidolia, and find a significant behavioral gap between humans and machines. We explore a variety of different strategies to close this gap and discover that the evolutionary need for humans to detect animal faces, as well as human faces, explains some of this gap. Finally, we propose a simple statistical model of pareidolia in images. Through studies on human subjects and our pareidolic face detectors we confirm a key prediction of our model regarding what image conditions are most likely to induce pareidolia.


# 84
Strong Double Blind
Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

Linlong Fan · Ye Huang · Yanqi Ge · Wen Li · Lixin Duan

Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global feature representation, hard to address 3D object recognition under arbitrary views. Due to the unaligned inputs from arbitrary views, it is challenging to robustly aggregate features, leading to performance degradation. In this paper, we introduce a novel Part-aware Network (PANet), which is a part-based representation, to address these issues. This part-based representation aims to localize and understand different parts of 3D objects, such as airplane wings and tails. It has properties such as viewpoint invariance and rotation robustness, which give it an advantage in addressing the 3D object recognition problem under arbitrary views. Our results on benchmark datasets clearly demonstrate that our proposed method outperforms existing view-based aggregation baselines for the task of 3D object recognition under arbitrary views, even surpassing most fixed viewpoint methods.


# 331
Strong Double Blind
An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers

CHI Zhang · Jingpu Cheng · Qianxiao Li

While recent advancements in model fine-tuning predominantly emphasize the utilization of low-rank adaptation (LoRA), we propose an alternative approach centered on reducing the precision of adaptation matrices. In particular, we depart from the common viewpoint that considers adaptation matrices solely as weight differences, and reinterpret them as "control variables'' to perturb pre-trained ViT systems. This new perspective enables the establishment of a control-oriented framework, facilitating the exploration of optimal controls guided by the Pontryagin Maximum Principle. Furthermore, we demonstrate that for bounded control sets such as hypercubes, the optimal controls often take on boundary values, leading naturally to a binary controller design. Theoretical analysis reveals that employing a binary control strategy achieves the same reachable state as its full-precision counterpart in the continuous idealisation of deep residual structures, a finding corroborated by later empirical investigations. Our studies further indicate that the controller's rank holds greater significance than its precision. As such, opting for low-precision yet high-rank controls is demonstrated to obtain better performance for practical vision tasks.


# 103
Strong Double Blind
OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers

Qitai Wang · Jiawei He · Yuntao Chen · Zhaoxiang Zhang

Existing end-to-end trackers for vision-based 3D perception suffer from performance degradation due to the conflict between detection and tracking tasks. In this work, we get to the bottom of this conflict, which was vaguely attributed to incompatible task-specific object features previously. We find the conflict between the two tasks lies in their partially conflicted classification gradients, which stems from their subtle difference in positive sample assignments. Based on this observation, we propose to coordinate those conflicted gradients by accurately identifying object queries with contradicted positivity in the two tasks. We also dynamically mask all attention between contracted object queries and modify the tracking classification loss to suppress inaccurate predictions. To this end, we propose OneTrack, the first one-stage joint detection and tracking model that bridges the gap between detection and tracking under a unified object feature representation. On the nuScenes camera-based object tracking benchmark, OneTrack outperforms previous works by 6.9% AMOTA on the validation set and by 3.3% AMOTA on the test set. The code will be released.


# 110
DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Narek Tumanyan · Assaf Singer · Shai Bagon · Tali Dekel

We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.


# 121
Strong Double Blind
Upper-body Hierarchical Graph for Skeleton Based Emotion Recognition in Assistive Driving

Jiehui Wu · Jiansheng Chen · Qifeng Luo · Siqi Liu · Youze Xue · Huimin Ma

Emotion recognition plays a crucial role in enhancing the safety and enjoyment of assisted driving experiences. By enabling intelligent systems to perceive and understand human emotions, we can significantly improve human-machine interactions. Current research in emotion recognition emphasizes facial expressions, speech and physiological signals, often overlooking body movement's expressive potential. Existing most methods, reliant on full-body poses and graph convolutional networks with predetermined adjacency matrices, face challenges in driving scenarios, including limited visibility, restricted movement and imbalanced data distribution, which affect model generalization and accuracy. To overcome these limitations, we introduce an innovative emotion recognition method tailored for assisted driving. Our method leverages upper-body skeleton sequences, overcoming the constraints of full-body pose capture in driving contexts. Our architecture employs an upper-body hierarchical graph (UbH-Graph) to dynamically capture upper-body movement and emotional state relationships. We uniquely incorporate class-specific variations during training, balancing feature distribution and enhancing emotion recognition. Our method outperforms existing multimodal approaches on the assistive driving dataset and demonstrates robustness and adaptability on the daily action dataset.


# 135
Strong Double Blind
SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Sheng-Wei Li · Zi-Xiang Wei · Wei-Jie Jack Chen · Yi-Hsin Yu · Chih-Yuan Yang · Jane Yung-jen Hsu

Existing zero-shot skeleton-based action recognition methods utilize projection networks to learn a shared latent space of skeleton features and semantic embeddings. The inherent imbalance in action recognition datasets, characterized by variable skeleton sequences yet constant class labels, presents significant challenges for alignment. To address the imbalance, we propose SA-DVAE---Semantic Alignment via Disentangled Variational Autoencoders, a method that first adopts feature disentanglement to separate skeleton features into two independent parts---one is semantic-related and another is irrelevant---to better align skeleton and semantic features. We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty. We conduct experiments on three benchmark datasets: NTU RGB+D, NTU RGB+D 120 and PKU-MMD, and our experimental results show that SA-DAVE produces improved performance over existing methods.


# 128
Strong Double Blind
Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast

Tatsuya Sasaki · Yoshiki Ito · Satoshi Kondo

While datasets on everyday actions, sports, and cooking are abundant, there's a significant scarcity in datasets focused on industrial domain activities, especially for distinguishing between proper and improper actions. This shortage poses a unique challenge, necessitating highly precise, context-sensitive feature extraction due to the subtle class distinctions, which are more nuanced than in general action recognition. To address this gap, we introduce a dataset featuring contrasting pairs of proper and improper actions, aimed at exploring these specific challenges, assessing the limitations of current methods, and establishing a new standard. Our dataset not only encompasses traditional industrial tasks, such as working at heights, but also extends to everyday situations like basketball, underscoring the task's broad relevance. By evaluating leading techniques on this dataset, we aim to unearth valuable insights, pushing the boundaries of action understanding in both industrial and everyday contexts.


# 136
Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition

Muhammad Adi Nugroho · Sangmin Woo · Sumin Lee · Jinyoung Park · Yooseung Wang · Donguk Kim · Changick Kim

Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware actor encoder to extract actor features and the two-pathways relation module to infer the interaction among actors and their activity. Flaming-Net leverages an additional optical flow modality in the training stage to enhance its motion awareness when finding locally active actors. The first pathway of the relation module, the actor-centric path, initially captures the temporal dynamics of individual actors and then constructs inter-actor relationships. In parallel, the group-centric path starts by building spatial connections between actors within the same timeframe and then captures simultaneous spatio-temporal dynamics among them. We demonstrate that Flaming-Net achieves new state-of-the-art WSGAR results on two benchmarks, including a 2.8%p higher MPCA score on the NBA dataset. Importantly, we use the optical flow modality only for training and not for inference.


# 130
Strong Double Blind
Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment

Wulian Yun · Mengshi Qi · Fei Peng · Huadong Ma

Existing action quality assessment (AQA) methods often require a large number of label annotations for fully supervised learning, which are laborious and expensive. In practice, the labeled data are difficult to obtain because the AQA annotation process requires domain-specific expertise. In this paper, we propose a novel semi-supervised method, which can be utilized for better assessment of the AQA task by exploiting a large amount of unlabeled data and a small portion of labeled data. Differing from the traditional teacher-student network, we propose a teacher-reference-student architecture to learn both unlabeled and labeled data, where the teacher network and the reference network are used to generate pseudo-labels for unlabeled data to supervise the student network. Specifically, the teacher predicts pseudo-labels by capturing high-level features of unlabeled data. The reference network provides more adequate supervision of the student network by referring to additional action information. Moreover, we introduce confidence memory to improve the reliability of pseudo-labels by storing the most accurate ever output of the teacher network and reference network. To validate our method, we conduct extensive experiments on three AQA benchmark datasets. Experimental results show that our method achieves significant improvements and outperforms existing semi-supervised AQA methods. We will release our code.


# 152
Strong Double Blind
Classification Matters: Improving Video Action Detection with Class-Specific Attention

Jinsung Lee · Taeoh Kim · Inwoong Lee · Minho Shim · Dongyoon Wee · Minsu Cho · Suha Kwak

Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions for classification, yet often overlooking the essential contextual information necessary for accurate classification. Accordingly, we propose to reduce the model's bias toward the actor itself and encourage it to pay attention to the context that is relevant to each action class. By assigning a class-dedicated query to each action class, the model can dynamically determine where to focus for effective classification. The proposed method demonstrates superior performance on three challenging benchmarks while using significantly fewer parameters and less computation.


# 141
Strong Double Blind
HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

Sakib Reza · Yuexi Zhang · Mohsen Moghaddam · Octavia Camps

Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: (Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: https://github.com/sakibreza/ECCV24-HAT/.


# 113
Appearance-based Refinement for Object-Centric Motion Segmentation

Junyu Xie · Weidi Xie · Andrew ZISSERMAN

The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes. Previous approaches have explored the use of optical flow for motion segmentation, leading to imperfect predictions due to partial motion, background distraction, and object articulations and interactions. To address this issue, we introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars, and an object-centric architecture that refines problematic masks based on exemplar information. The model is pre-trained on synthetic data and then adapted to real-world videos in a self-supervised manner, eliminating the need for human annotations. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59. We achieve competitive performance on single-object segmentation, while significantly outperforming existing models on the more challenging problem of multi-object segmentation. Finally, we investigate the benefits of using our model as a prompt for the per-frame Segment Anything Model.


# 112
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Shuangrui Ding · Rui Qian · Haohang Xu · Dahua Lin · Hongkai Xiong

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across three multi-object video segmentation tasks. Specifically, we achieve over 5 points of improvement in terms of FG-ARI on complex real-world DAVIS-17-Unsupervised and YouTube-VIS-19 compared to the previous best result.


# 116
Strong Double Blind
Fine-grained Dynamic Network for Generic Event Boundary Detection

Ziwei Zheng · Lijun He · Le Yang · Fan Li

Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art. The code is available at \url{https://github.com/anonymous}.


# 254
Strong Double Blind
Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

Shoma Iwai · Atsuki Osanai · Shunsuke Kitada · Shinichiro Omachi

Layout generation is a task to synthesize a harmonious layout with elements characterized by attributes such as category, position, and size. Human designers experiment with the placement and modification of elements to create aesthetic layouts, however, we observed that current discrete diffusion models (DDMs) struggle to correct inharmonious layouts after they have been generated. In this paper, we first provide novel insights into layout sticking phenomenon in DDMs and then propose a simple yet effective layout-assessment module Layout-Corrector, which works in conjunction with existing DDMs to address the layout sticking problem. We present a learning-based module capable of identifying inharmonious elements within layouts, considering overall layout harmony characterized by complex composition. During the generation process, Layout-Corrector evaluates the correctness of each token in the generated layout, reinitializing those with low scores to the ungenerated state. The DDM then uses the high-scored tokens as clues to regenerate the harmonized tokens. Layout-Corrector, tested on common benchmarks, consistently boosts layout-generation performance when in conjunction with various state-of-the-art DDMs. Furthermore, our extensive analysis demonstrates that the Layout-Corrector (1) successfully identifies erroneous tokens, (2) facilitates control over the fidelity-diversity trade-off, and (3) significantly mitigates the performance drop associated with fast sampling.


# 140
Strong Double Blind
Self-supervised visual learning from interactions with objects

Arthur Aubret · Céline Teulière · Jochen Triesch

Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method outperforms previous methods on downstream category recognition. In contrast to previous findings, our analysis suggests that the exact trade-off between viewpoint sensitivity/invariance is of modest importance for this. We rather find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.


# 143
Strong Double Blind
Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Cong Wu · Xiao-Jun Wu · Linze Li · Tianyang Xu · Zhenhua Feng · Josef Kittler

The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly refreshed the accuracy leaderboard of FSAR (Few-shot Action Recognition). However, the trainable overhead of ensuring that the domain alignment of CLIP and FSAR is often unbearable. To mitigate this issue, we present an Efficient Multi-Level Post-Reasoning Network, namely EMP-Net. By design, a post-reasoning mechanism is been proposed for domain adaptation, which avoids most gradient backpropagation, improving the efficiency; meanwhile, a multi-level representation is utilised during the reasoning and matching processes to improve the discriminability, ensuring effectiveness. Specifically, the proposed EMP-Net starts with a skip-fusion involving cached multi-stage features extracted by CLIP. After that, current feature are decoupled into multi-level representations, including global-level, patch-level, and frame-level. The ensuing spatiotemporal reasoning module operates on multi-level representations to generate discriminative features. As for matching, the multi-level contrasts between text-visual and support-query are integrated to provide a comprehensive guidance. The experimental results demonstrate that EMP-Net can unlock the potential performance of CLIP in a more efficient manner. Please find our code in supplementary materials.


# 144
Strong Double Blind
Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

Mathieu Simon · Pascal Frossard · Christophe De Vleeschouwer

This paper explores self-supervised disentangled representation learning within sequential data, focusing on untangling time-independent and time-varying factors in videos. We propose a new model that explicitly accounts for the causal relationship between the static/dynamic variables and improves model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions under which the ground truth factors can be identified, and introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into the framework. The experiments show that the proposed approach outperforms previous SOTA techniques which generalize poorly in more realistic scenarios where the dynamics of a scene are influenced by the content.


# 131
Strong Double Blind
Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

Yuan Tian · Guo Lu · Guangtao Zhai

Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc.To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs.Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model.Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency.Our approach outperforms previous coding methods on three mainstream tasks and six datasets.


# 137
Strong Double Blind
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval

Thomas Hummel · Shyamgopal Karthik · Mariana-Iuliana Georgescu · Zeynep Akata

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR. Our code and benchmark are freely available at https://github.com/ExplainableML/EgoCVR.


# 139
Video Question Answering with Procedural Programs

Rohan Choudhury · Koichiro Niinuma · Kris Kitani · Laszlo Jeni

We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present Procedural Video Querying (ProViQ) which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ProViQ with novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ProViQ to perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. We include video demos and code in the supplement.


# 274
ViLA: Efficient Video-Language Alignment for Video Question Answering

Xijun Wang · Junbang Liang · Chun-Kai Wang · Kenan Deng · Yu Lou · Ming C Lin · Shan Yang

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up.


# 275
ST-LLM: Large Language Models Are Effective Temporal Learners

Ruyang Liu · Chen Li · Haoran Tang · Yixiao Ge · Ying Shan · Ge Li

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench.


# 138
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Ali Zare · Yulei Niu · Hammad Ayyubi · Shih-Fu Chang

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.


# 186
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

KILICHBEK HAYDAROV · Xiaoqian Shen · Avinash Madasu · Mahmoud Salem · Li-Jia Li · Gamaleldin F Elsayed · Mohamed Elhoseiny

We introduce Affective Visual Dialog , an emotion explanation and reasoning task as a testbed for research on understanding constructed emotions in response to visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. Notably, the dataset spans a broad range of visual stimuli, covering human heritage and contemporary life, with an average per-turn answer length of about 12 words — 5 times that of the VisDial dataset — and explanations exceeding 28 words on average. We explain our determining design decisions in collecting the dataset, data inclusion and exclusion criteria starting from over 100K dialogs for quality control, and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We propose and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art multimodal models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations. The dataset and code will be publicly available


# 115
Strong Double Blind
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang · Peiwen Sun · Dongzhan Zhou · Guangyao Li · Honggang Zhang · Di Hu

Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions.


# 133
Strong Double Blind
Nonverbal Interaction Detection

Jianan Wei · Tianfei Zhou · Yi Yang · Wenguan Wang

This work addresses a new challenge of understanding human nonverbal interaction in social contexts. Nonverbal signals pervade virtually every communicative act. Our gestures, facial expressions, postures, gaze, even physical appearance all convey messages, without anything being said. Despite their critical role in social life, nonverbal signals receive very limited attention as compared to the linguistic counterparts, and existing solutions typically examine nonverbal cues in isolation. Our study marks the first systematic effort to enhance the interpretation of multifaceted nonverbal signals. First, we contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups, along with 22 atomic-level nonverbal behaviors under five broad interaction types. Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form from images. Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs. Central to the model is a dual multi-scale hypergraph that adeptly addresses individual-to-individual and group-to-group correlations across varying scales, facilitating interactional feature learning and eventually improving interaction prediction. Extensive experiments on NVI show that NVI-DEHR improves various baselines significantly in NVI-DET. It also exhibits leading performance on HOI-DET, confirming its versatility in supporting related tasks and strong generalization ability. We hope that our study will offer the community new avenues to explore nonverbal signals in more depth.


# 180
Strong Double Blind
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Tongkun Guan · Chengyu Lin · Wei Shen · Xiaokang Yang

Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M$^{2}$E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer.


# 51
Human-in-the-Loop Visual Re-ID for Population Size Estimation

Gustavo Perez · Daniel Sheldon · Grant Van Horn · Subhransu Maji

Computer vision-based re-identification (Re-ID) systems are increasingly being deployed for estimating population size in large image collections. However, the estimated size can be significantly inaccurate when the task is challenging or when deployed on data from new distributions. We propose a human-in-the-loop approach for estimating population size driven by a pairwise similarity derived from an off-the-shelf Re-ID system. Our approach, based on nested importance sampling, selects pairs of images for human vetting driven by the pairwise similarity, and produces asymptotically unbiased population size estimates with associated confidence intervals. We perform experiments on various animal Re-ID datasets and demonstrate that our method outperforms strong baselines and active clustering approaches. In many cases, we are able to reduce the error rates of the estimated size from around 80\% using CV alone to less than 20\% by vetting a fraction (often less than 0.002\%) of the total pairs. The cost of vetting reduces with the increase in accuracy and provides a practical approach for population size estimation within a desired tolerance when deploying Re-ID systems.


# 142
Strong Double Blind
PreLAR: World Model Pre-training with Learnable Action Representation

Lixuan Zhang · Meina Kan · Shiguang Shan · Xilin CHEN

The recent technique of Model-Based Reinforcement Learning learns to make decisions by building a world model about the dynamics of the environment. The world model learning requires extensive interactions with the real environment. Therefore, several innovative approaches such as APV proposed to unsupervised pre-train the world model from large-scale videos, allowing fewer interactions to fine-tune the world model. However, these methods only pre-train the world model as a video predictive model without action conditions, while the final world model is action-conditional. This gap limits the effectiveness of unsupervised pre-training in enhancing the world model's capabilities. To further release the potential of unsupervised pre-training, we introduce an approach that Pre-trains the world model from action-free videos but with Learnable Action Representation (PreLAR). Specifically, the observations of two adjacent time steps are encoded as an implicit action representation, with which the world model is pre-trained as action conditional. To make the implicit action representation closer to the real action, an action-state consistency loss is designed to self-supervise its optimization. During fine-tuning, the real actions are encoded as the action representation to train the overall world model for downstream tasks. The proposed method is evaluated on various visual control tasks from the Meta-world simulation environment. The results show that the proposed PreLAR significantly improves the sample efficiency in world model learning, demonstrating the necessity of incorporating action in the world model pre-training.


# 89
Strong Double Blind
Learning to Build by Building Your Own Instructions

Aaron Walsman · Muru Zhang · Adam Fishman · Ali Farhadi · Dieter Fox

Structural understanding of complex visual objects is an important unsolved component of artificial intelligence. To study this, we develop a new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure. We attack this problem by building an agent that we call InstructioNet that is able to make its own visual instruction book. By disassembling an unseen assembly and periodically saving images of it, the agent is able to create a set of instructions so that it has the information necessary to rebuild it. These instructions form an explicit memory that allows the model to reason about the assembly process one step at a time, avoiding the need for long-term implicit memory. This in turn allows us to train on much larger LEGO assemblies than has been possible in the past. To demonstrate the power of this model, we release a new dataset of procedurally built LEGO vehicles that contain an average of 31 bricks each and require over one hundred steps to disassemble and reassemble. We train these models using online imitation learning which allows the model to learn from its own mistakes. Finally, we also provide some small improvements to LTRON and the Break-and-Make problem that simplify the learning environment and improve usability.


# 174
Strong Double Blind
Situated Instruction Following

So Yeon Min · Xavier Puig · Devendra Singh Chaplot · Tsung-Yen Yang · Priyam Parashar · Akshara Rai · Ruslan Salakhutdinov · Yonatan Bisk · Roozbeh Mottaghi

Language is never spoken in a vacuum. It is expressed, comprehended, and contextualized within the holistic backdrop of the speaker's history, actions, and environment. Since humans are used to communicating efficiently with situated language, the practicality of robotic assistants hinge on their ability to understand and act upon implicit and situated instructions. In traditional instruction following paradigms, the agent acts alone in an empty house, leading to language use that is both simplified and artificially "complete." In contrast, we propose situated instruction following, which embraces the inherent underspecification and ambiguity of real-world communication with the physical presence of a human speaker. The meaning of situated instructions naturally unfold through the past actions and the expected future behaviors of the human involved. Specifically, within our settings we have instructions that (1) are ambiguously specified, (2) have temporally evolving intent, (3) can be interpreted more precisely with the agent's dynamic actions. Our experiments indicate that state-of-the-art Embodied Instruction Following (EIF) models lack holistic understanding of situated human intention.


# 196
Where am I? Scene Retrieval with Language

Jiaqi Chen · Daniel Barath · Iro Armeni · Marc Pollefeys · Hermann Blum

Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens further opportunities for language-based interaction with embodied agents, such as a user instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to ``coarse localization'' as we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. Therefore, we present Text2SceneGraphMatcher, a "scene-retrieval" pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are matched.


# 185
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi · Runpei Dong · Shaochen Zhang · Haoran Geng · Chunrui Han · Zheng Ge · Li Yi · Kaisheng Ma

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language–unified 3D interaction tasks, such as embodied visual grounding.


# 69
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin · Xidong Peng · peishan cong · Ge Zheng · Yujing Sun · Yuenan HOU · Xinge Zhu · Sibei Yang · Yuexin Ma

We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data, including 2D images and 3D LiDAR point clouds. We present a novel method, dubbed WildRefer, for this task by fully utilizing the rich appearance information in images, the position and geometric clues in point cloud as well as the semantic knowledge of language descriptions. Besides, we propose two novel datasets, i.e., STRefer and LifeRefer, which focus on large-scale human-centric daily-life scenarios accompanied with abundant 3D object and natural language annotations. Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots. Extensive experiments and ablation studies demonstrate that our method achieves state-of-the-art performance on the proposed benchmarks. Code and dataset will be released when the paper is published.


# 173
Strong Double Blind
SegPoint: Segment Any Point Cloud via Large Language Model

Shuting He · Henghui Ding · Xudong Jiang · Bihan Wen

Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation. To advance 3D instruction research, we introduce a new benchmark, Instruct3D, designed to evaluate segmentation performance from complex and implicit instructional texts, featuring 2,565 point cloud-instruction pairs. Our experimental results demonstrate that SegPoint achieves competitive performance on established benchmarks such as ScanRefer for referring segmentation and ScanNet for semantic segmentation, while delivering outstanding outcomes on the Instruct3D dataset. To our knowledge, SegPoint is the first model to address these varied segmentation tasks within a single framework, achieving satisfactory performance.


# 177
Strong Double Blind
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Jin Gao · Lei Gan · Yuankai Li · Yixin Ye · Dequan Wang

Large multimodal models (LMMs) excel in adhering to human instructions. However, self-contradictory instructions may arise due to the increasing trend of multimodal interaction and context length, which is challenging for language beginners and vulnerable populations. We introduce the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. It comprises 20,000 conflicts, evenly distributed between language and vision paradigms. It is constructed by a novel automatic dataset creation framework, which expedites the process and enables us to encompass a wide range of instruction forms. Our comprehensive evaluation reveals current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. Hence, we propose the Cognitive Awakening Prompting to inject cognition from external, largely enhancing dissonance detection. Here are our website, dataset, and code.


# 178
Strong Double Blind
GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

Yifeng Zhang · Ming Jiang · Qi Zhao

Large language models (LLMs) exhibit exceptional reasoning capabilities and have played significant roles in knowledge-based visual question-answering (VQA) systems. By conditioning on in-context examples and task-specific prompts, they comprehensively understand input questions and provide answers relevant to the context. However, due to the reliance on in-context examples, LLMs are susceptible to inheriting dataset biases in context descriptions and the provided examples. Innovative methods are required to ensure that LLMs can deliver unbiased yet contextually relevant responses. To tackle this challenge, we present GRAph-based Contextual DEbiasing (GRACE), a novel graph-based method for debiasing knowledge-based VQA models. This approach consists of two novel and generally applicable components. First, we propose an unsupervised context graph learning method that combats biases by explicitly creating a balanced context graph under the guidance of fairness constraints. Second, building upon the context graph, we consider both semantic features and reasoning processes to enhance prompting with more relevant and diverse in-context examples. Through extensive experimentation on both in-distribution (OK-VQA) and out-of-distribution (VQA-CP, GQA-OOD) datasets, we demonstrate the effectiveness of GRACE in mitigating biases and achieving generalization. Additionally, analyses of the model performance across gender groups demonstrate GRACE's potential impacts on social equity. Our source code is publicly available at https://shorturl.at/cdmnw.


# 206
LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images

Zonghao Guo · Ruyi Xu · Yuan Yao · Junbo Cui · Zanlin Ni · Chunjiang Ge · Tat-Seng Chua · Zhiyuan Liu · Gao Huang

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA 1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% computation, and achieves 6.4 accuracy improvement on TextVQA. All the data and codes will be publicly available to facilitate future research.


# 189
BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu · Yushi Hu · Bangzheng Li · Yu Feng · Haoyu Wang · Xudong Lin · Dan Roth · Noah A Smith · Wei-Chiu Ma · Ranjay Krishna

We introduce BLINK, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the BLINK tasks can be solved by humans ``within a blink'' (e.g., depth estimation, correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. BLINK reformats 14 classic computer vision tasks into 3,978 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, BLINK is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.32% and 45.46%, only 13.23% and 7.47% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe BLINK will stimulate the community to help multimodal LLMs catch up with human-level perception.


# 170
Strong Double Blind
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Jinrui Zhang · Teng Wang · Haigang Zhang · Ping Lu · Feng Zheng

Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training. Without intermediate reasoning steps, models may establish superficial shortcuts between instructions and responses, failing to internalize the inherent reasoning logic. To address this challenge, we propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning. Unlike previous methods that learning from responses only, our approach entails the model predicting rationales justifying why responses are correct or incorrect. This fosters a deeper engagement with the fine-grained reasoning underlying each response, thus enhancing the model’s reasoning proficiency. To facilitate this approach, we propose REVERIE, the first large-scale instruction-tuning dataset with ReflEctiVE RatIonalE annotations. REVERIE comprises 115k machine-generated reasoning instructions, each meticulously annotated with a corresponding pair of correct and confusing responses, alongside comprehensive rationales elucidating the justification behind the correctness or erroneousness of each response. Experimental results on multiple LVLM benchmarks reveal that reflective instruction tuning with the REVERIE dataset yields noticeable performance gain over the baseline model, demonstrating the effectiveness of reflecting from the rationales. Project page is at https://zjr2000.github.io/projects/reverie


# 195
Strong Double Blind
Teach CLIP to Develop a Number Sense for Ordinal Regression

Yao DU · Qiang Zhai · Weihang Dai · Xiaomeng Li

Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively.


# 176
Common Sense Reasoning for Deep Fake Detection

Yue Zhang · Ben Colman · Xiao Guo · Ali Shahriyari · Gaurav Bharaj

State-of-the-art deepfake detection approaches rely on image-based features extracted via neural networks. While these approaches trained in a supervised manner extract likely fake features, they may fall short in representing unnatural 'non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are easily perceived by humans and used to discern the authenticity of an image based on human common sense. Furthermore, image-based feature extraction methods that provide visual explanations via saliency maps can be hard to interpret for humans. To address these challenges, we frame deepfake detection as a Deepfake Detection VQA (DD-VQA) task and model human intuition by providing textual explanations that describe common sense reasons for labeling an image as real or fake. We introduce a new annotated dataset and propose a Vision and Language Transformer-based framework for the DD-VQA task. We also incorporate text and image-aware feature alignment formulation to enhance multi-modal representation learning. As a result, we improve upon existing deepfake detection models by integrating our learned vision representations, which reason over common sense knowledge from the DD-VQA task. We evaluate our method on both the performance of deepfake detection and the quality of the generated explanations. Our empirical results show that incorporating textual explanations into a deepfake task benefits detection performance, generalization ability, and the language-based interpretability of the deepfake detection task.


# 181
Strong Double Blind
Efficient Inference of Vision Instruction-Following Models with Elastic Cache

ZUYAN LIU · Benlin Liu · Jiahui Wang · Yuhao Dong · Guangyi Chen · Yongming Rao · Ranjay Krishna · Jiwen Lu

In the field of instruction-following large language models (LLMs), especially those extended to multimodal spaces, the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an ‘importance-driven cache merging’ strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their ‘distance’ with an offset, by which both the initial and most recent tokens are retained. Our approach has been validated on a range of vision instruction-following models. Results demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks and models. Code and model weights will made public upon acceptance.


# 200
Strong Double Blind
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Yang Zhou · Yongjian Wu · Jiya Saiyin · Bingzheng Wei · Maode Lai · Eric I Chang · Yan Xu

Prompt tuning methods have achieved remarkable success in parameter-efficient fine-tuning on large pre-trained models. However, their application to dual-modal fusion-based visual-language pre-trained models (VLPMs), such as GLIP, has encountered issues. Existing prompt tuning methods have not effectively addressed the modal mapping and aligning problem for tokens in different modalities, leading to poor transfer generalization. To address this issue, we propose Synchronous Dual Prompt Tuning (SDPT). SDPT initializes a single set of learnable unified prototype tokens in the established modal aligning space to represent the aligned semantics of text and image modalities for downstream tasks. Furthermore, SDPT establishes inverse linear projections that require no training to embed the information of unified prototype tokens into the input space of different modalities. The inverse linear projections allow the unified prototype token to synchronously represent the two modalities and enable SDPT to share the unified semantics of text and image for downstream tasks across different modal prompts. Experimental results demonstrate that SDPT assists fusion-based VLPMs to achieve superior outcomes with only 0.04% of model parameters for training across various scenarios, outperforming other single- or dual-modal methods.


# 193
Strong Double Blind
Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples

Chengen Lai · Shengli Song · Sitong Yan · Guangneng Hu

Vision and Language (VL) models have achieved remarkable performance in a variety of multimodal learning tasks. The success of these models is attributed to learning a joint and aligned representation space of visual and text. However, recent popular VL models still struggle with concepts understanding beyond bag-of-objects in images & texts, suffering from compositional reasoning about relationship between objects & attributes and word order. To address the above issues, we create a synthetic multimodal counterfactual dataset (COCO-CF) and propose a novel contrastive learning framework (COMO). We contribute the COCO-CF dataset which is automatically generated from MS-COCO by injecting concepts from off-the-shelf language models and diffusion models to reduce the bias of bag-of-objects. We contribute the COMO framework for effectively leveraging COCO-CF to treat the counterfactual samples as hard negatives and reweight their importance during contrastive learning. Extensive experiments and ablations show COMO achieved a significant improvement of VL concept understanding on the two VL-Checklist and Winoground benchmarks over five strong VL baselines in their zero-shot setting evaluations.


# 175
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Haoran Wei · Lingyu Kong · Jinyue Chen · liang zhao · Zheng Ge · Jinrong Yang · Jianjian Sun · Chunrui Han · Xiangyu Zhang

Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, i.e., CLIP, for common vision tasks. However, for some special task that needs dense and fine-grained perception, the CLIP-style vocabulary may encounter low efficiency in tokenizing corresponding vision knowledge and even suffer out-of-vocabulary problems. Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to compress rich vision signals. In the next, we scale up the vanilla vision vocabulary by merging the new with the original one (CLIP), enabling the LVLMs can effectively garner new features. We present frameworks with two sizes: Vary-base (7B) and Vary-toy (1.8B), both of which enjoy excellent fine-grained perception performance while maintaining great general ability.


# 187
Strong Double Blind
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali · Adrian Bulat · Brais Martinez · Georgios Tzimiropoulos

We present CLIP-DPO, a preference optimization method that leverages pretrained V-L (Vision-Language) embeddings models, such as CLIP, for DPO-based optimization of Vision LLMs. Starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are then ranked based on their CLIP image-text similarities to obtain a set of positive and negative pairs for DPO-based training. We show that this simple approach offers notable performance gains over a diverse set of benchmarks and vision-language tasks.


# 205
Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin · Deepak Pathak · Baiqi Li · Jiayao Li · Xide Xia · Graham Neubig · Pengchuan Zhang · Deva Ramanan

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the moon is over the cow" with "the cow is over the moon".To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show {text}?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model (which we will release) that follows best practices in the literature. For example, we find it useful to use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. Towards this end, we introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning such as comparison and logic. GenAI-Bench also collects over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, Midjourney, and Gen2. We will open-source our code, data, and model.


# 208
DOCCI: Descriptions of Connected and Contrasting Images

Yasumasa Onoe · Sunayana Rane · Zachary E Berger · Yonatan Bitton · Jaemin Cho · Roopal Garg · Alexander Ku · Zarana Pareshkumar Parekh · Jordi Pont-Tuset · Garrett Tanzer · Su Wang · Jason M Baldridge

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.


# 207
Strong Double Blind
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Mu Cai · Haotian Liu · Yuheng Li · Yijun Li · Eli Shechtman · Zhe Lin · Yong Jae Lee · Krishna Kumar Singh

In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.


# 203
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Yulin Luo · Ruichuan An · Bocheng Zou · Yiming Tang · Jiaming Liu · Shanghang Zhang

Subpopulation structure is a set of hierarchical relations among several subpopulations determined by a certain criteria. Discovering such structure provides comprehensive understanding of the dataset, which is benefitial to many downstream tasks, such as subpopulation shifts and slice discovery. Despite important, we find there has been no work that systematically explore the subpopulation structure of datasets. Considering that solving this task requires the method to have a broad understanding of various aspects of the datasets, in this work, we leverage the world knowledge, summarization, and instruction-following capabilities of Large Language Model (LLM) to explore the latent subpopulation structure of image datasets. Specifically, we propose a novel approach named Subpopulation Structure Discovery with Large Language Models (SSD-LLM), whose core idea is to generate and analyze the informative image captions and then summarize the structure characteristic of datasets based on the analysis using LLM. SSD-LLM consists of two novel prompt engineering components, Criteria Initialization and Criteria Self-Refinement, which ensures an token-efficient and reliable discovery process. SSD-LLM offers a unified paradigm to address multiple downstream tasks with simple task-specific prompt tuning, including dataset organization, longt tail attribute identification, slice discovery and our proposed slice prediction. We validate the effectiveness of SSD-LLM through these subpopulation-related tasks. We hope to inspire the community to explore potential of LLM as dataset analyst.


# 215
Strong Double Blind
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

Yunbin Tu · Liang Li · Li Su · Chenggang Yan · Qingming Huang

Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets.


# 220
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

Zhen Wang · Xinyun Jiang · Jun Xiao · Tao Chen · Long Chen

Explicit Caption Editing (ECE) -- refining reference image captions through a sequence of explicit edit operations (e.g., KEEP, DETELE) -- has raised significant attention due to its explainable and human-like nature. After training with carefully designed reference and ground-truth caption pairs, state-of-the-art ECE models exhibit limited generalization ability beyond the original training data distribution, i.e., they are tailored to refine content details only in in-domain samples but fail to correct errors in out-of-domain samples. To this end, we propose a new Diffusion-based Explicit Caption editing method: DECap. Specifically, we reformulate the ECE task as a denoising process under the diffusion mechanism, and introduce innovative edit-based noising and denoising processes. Thanks to this design, the noising process can help to eliminate the need for meticulous paired data selection by directly introducing word-level noises for training, learning diverse distribution over input reference caption. The denoising process involves the explicit predictions of edit operations and corresponding content words, refining reference captions through iterative step-wise editing. To further efficiently implement our diffusion process and improve the inference speed, DECap discards the prevalent multi-stage design and directly generates edit operations and content words simultaneously. Extensive ablations have demonstrated the strong generalization ability of DECap in various scenarios. More interestingly, it even shows great potential in improving the quality and controllability of caption generation.


# 192
Strong Double Blind
Conceptual Codebook Learning for Vision-Language Models

Yi Zhang · Ke Yu · Siqi Wu · Zhihai He

In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.


# 197
Strong Double Blind
Do Generalised Classifiers really work on Human Drawn Sketches?

Hmrishav Bandyopadhyay · Pinaki Nath Chowdhury · Aneeshan Sain · Subhadeep Koley · Tao Xiang · Ayan Kumar Bhunia · Yi-Zhe Song

This paper, for the first time, marries large foundation models with human sketch understanding. We demonstrate what this brings -- a paradigm shift in terms of generalised sketch representation learning (e.g., classification). This generalisation happens on two fronts: (i) generalisation across unknown categories (i.e., open-set), and (ii) generalisation traversing abstraction levels (i.e., good and bad sketches), both being timely challenges that remain unsolved in the sketch literature. Our design is intuitive and centred around transferring the already stellar generalisation ability of CLIP to benefit generalised learning for sketches. We first "condition" the vanilla CLIP model by learning sketch-specific prompts using a novel auxiliary head of raster to vector sketch conversion. This importantly makes CLIP "sketch-aware". We then make CLIP acute to the inherently different sketch abstraction levels. This is achieved by learning a codebook of abstraction-specific prompt biases, a weighted combination of which facilitates the representation of sketches across abstraction levels -- low abstract edge-maps, medium abstract sketches in TU-Berlin, and highly abstract doodles in QuickDraw. Our framework surpasses popular sketch representation learning algorithms in both zero-shot and few-shot setups and in novel settings across different abstraction boundaries.


# 100
3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views

Evangelos Ververas · Polydefkis Gkagkos · Jiankang Deng · Michail C Doukas · Jia Guo · Stefanos Zafeiriou

Developing gaze estimation models that generalize well to unseen domains and in-the-wild conditions remains a challenge with no known best solution. This is mostly due to the difficulty of acquiring ground truth data that cover the distribution of faces, head poses, and environments that exist in the real world. Most recent methods attempt to close the gap between specific source and target domains using domain adaptation. In this work, we propose to train general gaze estimation models which can be directly employed in novel environments without adaptation. To do so, we leverage the observation that head, body, and hand pose estimation benefit from revising them as dense 3D coordinate prediction, and similarly express gaze estimation as regression of dense 3D eye meshes. To close the gap between image domains, we create a large-scale dataset of diverse faces with gaze pseudo-annotations, which we extract based on the 3D geometry of the scene, and design a multi-view supervision framework to balance their effect during training. We test our method in the task of gaze generalization, in which we demonstrate improvement of up to 30% compared to state-of-the-art when no ground truth data are available, and up to 10% when they are. The project material will become available for research purposes.


# 194
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Muhammad Jehanzeb Mirza · Leonid Karlinsky · Wei Lin · Sivan Doveh · Jakub Micorek · Mateusz Kozinski · Hilde Kuehne · Horst Possegger

Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively.


# 213
Strong Double Blind
PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

Jicheol Park · Dongwon Kim · Boseung Jeong · Suha Kwak

Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.


# 188
Discovering Unwritten Visual Classifiers with Large Language Models

Mia Chiquier · Utkarsh Mall · Carl Vondrick

Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that utilizes a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.


# 48
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Yixuan Wu · Yizhou Wang · Shixiang Tang · Wenhao Wu · Tong He · Wanli Ouyang · Philip Torr · Jian Wu

We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g, zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting. The codes shall be released upon acceptance.


# 46
Strong Double Blind
LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Penghui Du · Yu Wang · Yifan Sun · Luting Wang · Yue Liao · gang zhang · Errui Ding · Yan Wang · Jingdong Wang · Si Liu

Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP. However, two main challenges emerge: (1) A deficiency in concept representation, where the category names in CLIP's text space lack textual and visual knowledge. (2) An overfitting tendency towards base categories, with the open vocabulary knowledge biased towards base categories during the transfer from VLMs to detectors. To address these challenges, we propose the Language Model Instruction (LaMI) strategy, which leverages the relationships between visual concepts and applies them within a simple yet effective DETR-like detector, termed LaMI-DETR. LaMI utilizes GPT to construct visual concepts and employs T5 to investigate visual similarities across categories. These inter-category relationships refine concept representation and avoid overfitting to base categories. Comprehensive experiments validate our approach's superior performance over existing methods in the same rigorous setting without reliance on external training resources. LaMI-DETR achieves a rare box AP of 43.4 on OV-LVIS, surpassing the previous best by 7.8 rare box AP.


# 54
Strong Double Blind
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction

Yansheng Li · Tingzhu Wang · Kang Wu · Linlin Wang · Xin Guo · Wenbin Wang

Scene Graph Generation (SGG) aims to explore the relationships between objects in images and obtain scene summary graphs, thereby better serving downstream tasks. However, the long-tailed problem has adversely affected the scene graph's quality. The predictions are dominated by coarse-grained relationships, lacking more informative fine-grained ones. The union region of one object pair (i.e., one sample) contains rich and dedicated contextual information, enabling the prediction of the sample-specific bias for refining the original relationship prediction. Therefore, we propose a novel Sample-Level Bias Prediction (SBP) method for fine-grained SGG (SBG). Firstly, we train a classic SGG model and construct a correction bias set by calculating the margin between the ground truth label and the predicted label with the trained classic SGG model. Then, we devise a Bias-Oriented Generative Adversarial Network (BGAN) that learns to predict the constructed correction biases, which can be utilized to correct the original predictions from coarse-grained relationships to fine-grained ones. The extensive experiments on VG and GQA datasets demonstrate that our SBG outperforms the state-of-the-art methods in terms of Average@K across three mainstream SGG models: Motif, VCtree, and Transformer. Compared to dataset-level correction methods, our SBG shows a significant average improvement of 5.6%, 3.9%, and 3.2% on Average@K for tasks PredCls, SGCls, and SGDet, respectively. The code will be available publicly.


# 52
OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

Zhenyu Wang · Ya-Li Li · TAICHI LIU · Hengshuang ZHAO · Shengjin Wang

In the current state of 3D object detection research, the severe scarcity of annotated 3D data, substantial disparities across different data modalities, and the absence of a unified architecture, have impeded the progress towards the goal of universality. In this paper, we propose OV-Uni3DETR, a unified open-vocabulary 3D detector via cycle-modality propagation. Compared with existing 3D detectors, OV-Uni3DETR offers distinct advantages: 1) Open-vocabulary 3D detection: During training, it leverages various accessible data, especially extensive 2D detection images, to boost training diversity. During inference, it can detect both seen and unseen classes. 2) Modality unifying: It seamlessly accommodates input data from any given modality, effectively addressing scenarios involving disparate modalities or missing sensor information, thereby supporting test-time modality switching. 3) Scene unifying: It provides a unified multi-modal model architecture for diverse scenes collected by distinct sensors. Specifically, we propose the cycle-modality propagation, aimed at propagating knowledge bridging 2D and 3D modalities, to support the aforementioned functionalities. 2D semantic knowledge from large-vocabulary learning guides novel class discovery in the 3D domain, and 3D geometric knowledge provides localization supervision for 2D detection images. OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average. Its performance using only RGB images is on par with or even surpasses that of previous point cloud based methods. Code and pre-trained models will be released later.


# 165
Rotary Position Embedding for Vision Transformer

Byeongho Heo · Song Park · Dongyoon Han · Sangdoo Yun

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models will be publicly released.


# 171
Strong Double Blind
Multi-branch Collaborative Learning Network for 3D Visual Grounding

Zhipeng Qian · Yiwei Ma · Zhekai Lin · Jiayi Ji · Xiawu Zheng · Xiaoshuai Sun · Rongrong Ji

3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating the potential for collaboration between them. However, existing collaborative approaches predominantly depend on the predictions of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model's capacity to learn specific information for each task, enabling them to acquire complementary knowledge. Thus, we propose the MCLN framework, which includes independent branches for 3DREC and 3DRES tasks. This enables dedicated exploration of each task and effective coordination between the branches. Furthermore, to facilitate mutual reinforcement between these branches, we introduce a Relative Superpoint Aggregation (RSA) module and an Adaptive Soft Alignment (ASA) module. These modules significantly contribute to the precise alignment of prediction results from the two branches, directing the module to allocate increased attention to key positions. Comprehensive experimental evaluation demonstrates that our proposed method achieves state-of-the-art performance on both the 3DREC and 3DRES tasks, with an increase of 3.27% in Acc@0.5 for 3DREC and 5.22% in mIOU for 3DRES.


# 204
SILC: Improving Vision Language Pretraining with Self-Distillation

Muhammad Ferjad Naeem · Yongqin Xian · Xiaohua Zhai · Lukas Hoyer · Luc Van Gool · Federico Tombari

Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation, while also providing improvements on image-level tasks such as classification and retrieval. SILC models sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation. We further show that SILC features greatly benefit open vocabulary detection, captioning and visual question answering.


# 184
Strong Double Blind
LiteSAM is Actually what you Need for segment Everything

Jianhai Fu · Yuanjie Yu · Ningchuan Li · Yi Zhang · Qichao Chen · Jianping Xiong · Jun Yin · Zhiyu Xiang

The Segment Anything model (SAM) has brought significant changes to the segmentation field with its superior performance, but its extensive computational resource requirements remain a limiting factor. Many works, such as MobileSAM, Edge-SAM, and MobileSAM-v2, have explored lightweight solutions. However, their use of traditional Grid Search sampling strategies or two-stage concatenation methods, which do not allow for end-to-end training, severely limit the performance of segment everything (SegEvery). This paper introduces Lite-SAM, an efficient end-to-end solution for the SegEvery task designed to reduce computational costs and redundancy. Lite-SAM is composed of four main components: a streamlined CNN-Transformer hybrid encoder (LiteViT), an automated prompt proposal network (AutoPPN), a traditional prompt encoder, and a mask decoder. All these components are integrated within the SAM framework. Our LiteViT, a high-performance lightweight backbone network, has only 1.16M parameters, which is a 23% reduction compared to the lightest existing backbone network Shufflenet. We also introduce AutoPPN, an innovative end-to-end method for prompt word generation. This is an improvement over traditional grid search sampling methods, and its unique design allows for easy integration into any SAM series algorithm, extending its usability. we have thoroughly benchmarked Lite-SAM across a plethora of both public and private datasets. The evaluation encompassed a broad spectrum of universal metrics, including the number of parameters, SegEvery execution time, and accuracy. The findings reveal that Lite-SAM, operating with a lean 4.2M parameters, significantly outpaces its counterparts, demonstrating performance improvements of 43x, 31x, 20x, 21x, and 1.6x over SAM, MobileSAM, Edge-SAM, EfficientViT-SAM, and MobileSAM-v2 respectively, all the while maintaining competitive accuracy. This underscores Lite-SAM's prowess in achieving an optimal equilibrium between performance and precision, thereby setting a new state-of-the-art(SOTA) benchmark in the domain.


# 214
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Sanghyun Jo · Soohyun Ryu · Sungyub Kim · Eunho Yang · Kyungsu Kim

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP embeddings prioritizing one specific tag in image-text relationships. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. We first extract all image-relevant tags from text based on their similarity to the nearest pixels. Then, we distill a combined mask containing the extracted tags' content to a text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code and data are available at https://github.com/shjo-april/TTD.


# 45
Strong Double Blind
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Dahyun Kang · Minsu Cho

Vision-and-language foundation models have shown impressive hallmarks on zero-shot image classification, where the target classes are represented in text descriptions with no labeled image examples. Recent work spans such powerful image and text correspondence to open-vocabulary segmentation, \ie, predicting pixel and text correspondence without pixel-level supervision on the unseen target classes. Plenty of the previous art casts this task as pixel-to-text classification without the goal of comprehending objects within an image. We believe segmentation is a visual understanding task and advocate decoupling segmentation from visual grounding. To this end, we introduce Lazy Visual Grounding for zero-shot open-vocabulary segmentation. Lazy visual grounding first discovers distinguishable visual units as object masks with iterative graph cuts and then assigns text on the discovered visual objects in a late interaction manner. Our model is training-free yet shows great performance on four public datasets: Pascal VOC, COCO-object, COCO-stuff, and ADE 20K, and especially, demonstrates visually appealing segmentation results, indicating the model capability to comprehend visual objectness. Code and data will be released once accepted.


# 212
Strong Double Blind
CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

Cristina Mata · Kanchana N Ranasinghe · Michael S Ryoo

Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel covariance-based pixel-text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are combined and fed to a frozen CLIP model. In experiments on GTA$\rightarrow$Cityscapes and Synthia$\rightarrow$Cityscapes, we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation.


# 42
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

Lingchen Meng · Shiyi Lan · Hengduo Li · Jose M Alvarez · Zuxuan Wu · Yu-Gang Jiang

In-context segmentation aims at segmenting novel images using a few labeled example images, termed as “in-context examples'', exploring content similarities between examples and the target. The resulting models can be generalized seamlessly to novel segmentation tasks, significantly reducing the labeling and training costs compared with conventional pipelines. However, in-context segmentation is more challenging than classic ones requiring the model to learn segmentation rules conditioned on a few samples. Unlike previous work with ad-hoc or non-end-to-end designs, we propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM). In particular, SEGIC leverages the emergent correspondence within VFM to capture dense relationships between target images and in-context samples. As such, information from in-context samples is then extracted into three types of instructions, i.e. geometric, visual, and meta instructions, serving as explicit conditions for the final mask prediction. SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks. Notably, SEGIC can be easily generalized to diverse tasks, including video object segmentation and open-vocabulary segmentation.


# 38
Strong Double Blind
Click Prompt Learning with Optimal Transport for Interactive Segmentation

Jie Liu · haochen wang · Wenzhe Yin · Jan-Jakob Sonke · Efstratios Gavves

Click-based interactive segmentation aims to segment target objects conditioned on user-provided clicks. Existing methods typically interpret user intention by learning multiple click prompts to generate corresponding prompt-activated masks, and selecting one from these masks. However, directly matching each prompt to the same visual feature often leads to homogeneous prompt-activated masks, as it pushes the click prompts to converge to one point. To address this problem, we propose Click Prompt Learning with Optimal Transport (CPlot), which leverages optimal transport theory to capture diverse user intentions with multiple click prompts. Specifically, we first introduce a prompt-pixel alignment module (PPAM), which aligns each click prompts with the visual features in the same feature space by plain transformer blocks. In such way, PPAM enables all click prompts to encode more general knowledge about regions of interest, indicating a consistent user intention. To capture diverse user intentions, we further propose the click prompt optimal transport module (CPOT) to match click prompts and visual features. CPOT is designed to learn an optimal mapping between click prompts and visual features. Such unique mapping facilities click prompts to effectively focus on distinct visual regions, which reflect underlying diverse user intentions. Furthermore, CPlot learns click prompts with a two-stage optimization strategy: the inner loop optimizes the optimal transport distance to align visual features with click prompts through the Sinkhorn algorithm, while the outer loop adjusts the click prompts from the supervised data. Extensive experiments on eight interactive segmentation benchmarks demonstrate the superiority of our method for interactive segmentation.


# 50
3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Zihao Xiao · Longlong Jing · Shangxuan Wu · Alex Zihao Zhu · Jingwei Ji · Chiyu Max Jiang · Wei-Chih Hung · Thomas Funkhouser · Weicheng Kuo · Anelia Angelova · Yin Zhou · Shiwei Sheng

3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not guarantee good performance due to poor per-mask classification quality, especially for novel stuff categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms the strong baseline by a large margin.


# 40
Segment and Recognize Anything at Any Granularity

Feng Li · Hao Zhang · Peize Sun · Xueyan Zou · Shilong Liu · Chunyuan Li · Jianwei Yang · Lei Zhang · Jianfeng Gao

In this work, we introduce Semantic-SAM, an augmented image segmentation foundation for segmenting and recognizing anything at desired granularities. Compared to the foundational segmentation model SAM, our model has two unique advantages: (i) granularity-controllability in that the model can produce segmentation masks at any desired granularities, from objects to parts to both; (ii) semantic-awareness in that the model simultaneously predicts semantic labels for masks at different granularities. To enable multi-granularity capabilities, we propose a multichoice learning scheme, where each click point generates a set of masks at multiple levels of granularity, correspondx0002ing to a set of ground-truth masks. To achieve semantic awareness, we consolidate multiple datasets of different levels of granularity and train our model using decoupled object- and part-based tasks to facilitate knowledge sharing and transfer among different tasks. To the best of our knowledge, this work is the first attempt to jointly train a model on SA-1B, instance-level, and part-level segmentation datasets. Experimental results and visualizations demonstrate that our model successfully achieves the desired goals. Furthermore, we show that multi-task training using the segmentation task defined on SA-1B and other segmentation tasks (e.g., panoptic and part segmentation) leads to performance gains on all segmentation tasks. In particular, we achieve a new state-of-the-art in COCO panoptic segmentation 60.2 PQ by adding SAM data.


# 39
Strong Double Blind
SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Christian Wilms · Tim Rolff · Maris N Hillemann · Robert Johanson · Simone Frintrop

We propose an approach for Open-World Instance Segmentation (OWIS), a task that aims to segment arbitrary unknown objects in images by generalizing from a limited set of object classes during training. Our Segment Object System (SOS) explicitly addresses the generalization ability and the low precision of state-of-the-art systems, which often generate background detections. To this end, we generate high-quality pseudo annotations based on the recent foundation model SAM. We thoroughly study various object priors to generate prompts for SAM, explicitly focusing the foundation model on objects. The strongest object priors were obtained by self-attention maps from self-supervised Vision Transformers, which we utilize for prompting SAM. Finally, the post-processed segments from SAM are used as pseudo annotations to train a standard instance segmentation system. Our approach shows strong generalization capabilities on COCO, LVIS, and ADE20k datasets and improves on the precision of the results by up to 81.6% compared to the state-of-the-art.


# 41
Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images

Ruiqi Wang · Akshay Gadi Patil · Fenggen Yu · Hao Richard Zhang

We introduce the first active learning (AL) model for high-accuracy instance segmentation of moveable parts from RGB images of real indoor scenes. Specifically, our goal is to obtain fully validated segmentation results by humans while minimizing manual effort. To this end, we employ a transformer that utilizes a masked-attention mechanism to supervise the active segmentation. To enhance the network tailored to moveable parts, we introduce a coarse-to-fine AL approach which first uses an object-aware masked attention and then a pose-aware one, leveraging the hierarchical nature of the problem and a correlation between moveable parts and object poses and interaction directions. When applying our AL model to 2,000 real images, we obtain fully validated moveable part segmentations with semantic labels, by only needing to manually annotate 11.45% of the images. This translates to significant (60%) time saving over manual effort required by the best non-AL model to attain the same segmentation accuracy. At last, we contribute a dataset of 2,550 real images with annotated moveable parts, demonstrating its superior quality and diversity over the best alternatives.


# 43
Strong Double Blind
Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation

Hoyong Kwon · Jaeseok Jeong · Sung-Hoon Yoon · Kuk-Jin Yoon

Weakly Supervised Semantic Segmentation (WSSS) with image-level supervision typically acquires object localization information from Class Activation Maps (CAMs). While Vision Transformers (ViTs) in WSSS have been increasingly explored for their superior performance in understanding global context, CAMs from ViT still show imprecise localization in boundary areas and false-positive activation. This paper proposes a novel WSSS framework that targets these issues based on the information from the frequency domain. In our framework, we introduce the Magnitude-mixing Aided Phase Accentuation (MAPA) module, which guides the classifier to prioritize phase information containing high-level semantic details. By perturbing and mixing the magnitude, MAPA guides the classifier to accentuate and concentrate on the shape information in the phase, thereby leading to finer distinctions in CAMs boundary regions. Additionally, inspired by empirical observations that the classification "shortcut" in the frequency domain can induce false positives in CAMs, we introduce a Frequency Shortcut Deterrent (FSD) module. This module aims to discourage the formation of such shortcuts, thereby mitigating false positives. The effectiveness of our approach is demonstrated by achieving new state-of-the-art performance on both PASCAL VOC 2012 and MS COCO 2014 datasets. The code will be released.


# 35
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

Jiannan Ge · Lingxi Xie · Hongtao Xie · Pandeng Li · Xiaopeng Zhang · Yongdong Zhang · Qi Tian

A serious issue that harms the performance of zero-shot visual recognition is named objective misalignment, i.e., the learning objective prioritizes improving the recognition accuracy of seen classes rather than unseen classes, while the latter is the true target to pursue. This issue becomes more significant in zero-shot image segmentation because the stronger (i.e., pixel-level) supervision brings a larger gap between seen and unseen classes. To mitigate it, we propose a novel architecture named AlignZeg, which embodies a comprehensive improvement of the segmentation pipeline, including proposal extraction, classification, and correction, to better fit the goal of zero-shot segmentation. (1) Mutually-Refined Proposal Extraction. AlignZeg harnesses a mutual interaction between mask queries and visual features, facilitating detailed class-agnostic mask proposal extraction. (2) Generalization-Enhanced Proposal Classification. AlignZeg introduces synthetic data and incorporates multiple background prototypes to allocate a more generalizable feature space. (3) Predictive Bias Correction. During the inference stage, AlignZeg uses a class indicator to find potential unseen class proposals followed by a prediction postprocess to correct the prediction bias. Experiments demonstrate that AlignZeg markedly enhances zero-shot semantic segmentation, as shown by an average 3.5% increase in hIoU, largely attributed to a 7.1% improvement in identifying unseen classes, and we further validate that the improvement comes from alleviating the objective misalignment issue.


# 33
Strong Double Blind
Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation

Prantik Howlader · Hieu Le · Dimitris Samaras

Semi-supervised semantic segmentation methods leverage unlabeled data by pseudo-labeling them. Thus the success of these methods hinges on the reliablility of the pseudo-labels. Existing methods mostly choose high-confidence pixels in an effort to avoid erroneous pseudo-labels. However, high confidence does not guarantee correct pseudo-labels especially in the initial training iterations. In this paper, we propose a novel approach to reliably learn from pseudo-labels. First, we unify the predictions from a trained object detector and a semantic segmentation model to identify reliable pseudo-label pixels. Second, we assign different learning weights to pseudo-labeled pixels to avoid noisy training signals. To determine these weights, we first use the reliable pseudo-label pixels identified from the first step and labeled pixels to construct a prototype for each class. Then, the per-pixel weight is the structural similarity between the pixel and the prototype measured via rank-statistics similarity. This metric is robust to noise, making it better suited for comparing features from unlabeled images, particularly in the initial training phases where wrong pseudo labels are prone to occur. We show that our method can be easily integrated into four semi-supervised semantic segmentation frameworks, and improves them in both Cityscapes and Pascal VOC datasets.


# 55
SAM-guided Graph Cut for 3D Instance Segmentation

Haoyu Guo · He Zhu · Sida Peng · Yuang Wang · Yujun Shen · Ruizhen Hu · Xiaowei Zhou

This paper addresses the challenge of 3D instance segmentation by simultaneously leveraging 3D geometric and multi-view image information. Many previous works have applied deep learning techniques to 3D point clouds for instance segmentation. However, these methods often failed to generalize to various types of scenes due to the scarcity and low-diversity of labeled 3D point cloud data. Some recent works have attempted to lift 2D instance segmentations to 3D within a bottom-up framework. The inconsistency in 2D instance segmentations among views can substantially degrade the performance of 3D segmentation. In this work, we introduce a novel 3D-to-2D query framework to effectively exploit 2D segmentation models for 3D instance segmentation. Specifically, we pre-segment the scene into several superpoints in 3D, formulating the task into a graph cut problem. The superpoint graph is constructed based on 2D segmentation models, enabling great segmentation performance on various types of scenes. We employ GNN to further improve the robustness, which can be trained using pseudo 3D labels generated from 2D segmentation models. Experimental results on the ScanNet200, ScanNet++ and KITTI-360 datasets demonstrate that our method achieves state-of-the-art segmentation performance. Code will be made publicly available for reproducibility.


# 56
Strong Double Blind
Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation

Jiawei Han · Kaiqi Liu · Wei Li · Guangzhi Chen

Point cloud semantic segmentation can significantly enhance the perception of an intelligent agent. Nevertheless, the discriminative capability of the segmentation network is influenced by the quantity of samples available for different categories. To mitigate the cognitive bias induced by class imbalance, this paper introduces a novel method, namely subspace prototype guidance (SPG), to guide the training of segmentation network. Specifically, the point cloud is initially separated into independent point sets by category to provide initial conditions for the generation of feature subspaces. The auxiliary branch which consists of an encoder and a projection head maps these point sets into separate feature subspaces. Subsequently, the feature prototypes which are extracted from the current separate subspaces and then combined with prototypes of historical subspaces guide the feature space of main branch to enhance the discriminability of features of minority categories. The prototypes derived from the feature space of main branch are also employed to guide the training of the auxiliary branch, forming a supervisory loop to maintain consistent convergence of the entire network. The experiments conducted on the large public benchmarks (i.e. S3DIS, ScanNet v2, ScanNet200, Toronto-3D) and collected real-world data illustrate that the proposed method significantly improves the segmentation performance and surpasses the state-of-the-art method.


# 53
Strong Double Blind
Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection

Jiacheng Deng · Jiahao Lu · Tianzhu Zhang

3D object detection is essential for understanding 3D scenes. Contemporary techniques often require extensive annotated training data, yet obtaining point-wise annotations for point clouds is time-consuming and laborious. Recent developments in semi-supervised methods seek to mitigate this problem by employing a teacher-student framework to generate pseudo-labels for unlabeled point clouds. However, these pseudo-labels frequently suffer from insufficient diversity and inferior quality. To overcome these hurdles, we introduce an Agent-based Diffusion Model for Semi-supervised 3D Object Detection (Diff3DETR). Specifically, an agent-based object query generator is designed to produce object queries that effectively adapt to dynamic scenes while striking a balance between sampling locations and content embedding. Additionally, a box-aware denoising module utilizes the DDIM denoising process and the long-range attention in the transformer decoder to refine bounding boxes incrementally. Extensive experiments on ScanNet and SUN RGB-D datasets demonstrate that Diff3DETR outperforms state-of-the-art semi-supervised 3D object detection methods.


# 47
Shifted Autoencoders for Point Annotation Restoration in Object Counting

Yuda Zou · Xin Xiao · Peilin Zhou · Zhichao Sun · Bo Du · Yongchao Xu

Object counting typically uses 2D point annotations. The complexity of object shapes and the subjectivity of annotators may lead to annotation inconsistency, potentially confusing counting model training. Some sophisticated noise-resistance counting methods have been proposed to alleviate this issue. Differently, we aim to directly refine the initial point annotations before training counting models. For that, we propose the Shifted Autoencoders (SAE), which enhances annotation consistency. Specifically, SAE applies random shifts to initial point annotations and employs a UNet to restore them to their original positions. Similar to MAE reconstruction, the trained SAE captures general position knowledge and ignores specific manual offset noise. This allows to restore the initial point annotations to more general and thus consistent positions. Extensive experiments show that using such refined consistent annotations to train some advanced (including noise-resistance) object counting models steadily/significantly boosts their performances. Remarkably, the proposed SAE helps to set new records on nine datasets. We will make codes and refined point annotations available.


# 37
Strong Double Blind
Learning Camouflaged Object Detection from Noisy Pseudo Label

Jin Zhang · Ruiheng Zhang · Yanjiao Shi · Zhe Cao · Nian Liu · Fahad Shahbaz Khan

Existing Camouflaged Object Detection (COD) methods rely heavily on large-scale pixel-annotated training sets, which are both time-consuming and labor-intensive. Although weakly supervised methods offer higher annotation efficiency, their performance is far behind due to the unclear visual demarcations between foreground and background in camouflaged images. In this paper, we explore the potential of using boxes as prompts in camouflaged scenes and introduce the first weakly semi-supervised COD method, aiming for budget-efficient and high-precision camouflaged object segmentation with an extremely limited number of fully labeled images. Critically, learning from such limited set inevitably generates pseudo labels with serious noisy pixels. To address this, we propose a noise correction loss that facilitates the model's learning of correct pixels in the early learning stage, and corrects the error risk gradients dominated by noisy pixels in the memorization stage, ultimately achieving accurate segmentation of camouflaged objects from noisy labels. When using only 20\% of fully labeled data, our method shows superior performance over the state-of-the-art methods.


# 36
Strong Double Blind
Just a Hint: Point-Supervised Camouflaged Object Detection

Huafeng Chen · Dian SHAO · Guangqian Guo · shan gao

Camouflaged Object Detection (COD) demands models to expeditiously and accurately distinguish objects which concealed themselves seamlessly in the environment. Owing to the subtle differences and ambiguous boundaries, COD is not only a remarkably challenging task for models but also for human annotators, requiring huge efforts to provide pixel-wise annotations. To alleviate the heavy annotation burden, we propose to fulfill this task with the help of only one point supervision. Specifically, by swiftly clicking on each object, we first adaptively expand the original point-based annotation to a reasonable hint area. Then, to avoid partial localization around discriminative parts, we propose an attention regulator to scatter model attention to the whole object through partially masking labeled regions. Moreover, to solve the unstable feature representation of camouflaged objects under only point-based annotation, we perform unsupervised contrastive learning based on differently augmented image pairs (e.g. changing color or doing translation). On three mainstream COD benchmarks, experimental results show that our model outperforms several weakly-supervised methods by a large margin across various metrics. The source codes and trained models will be publicly released.


# 34
Rectify the Regression Bias in Long-Tailed Object Detection

Ke Zhu · Minghao Fu · Jie Shao · Tianyu Liu · Jianxin Wu

Long-tailed object detection faces great challenges because of its extremely imbalanced class distribution. Recent methods mainly focus on the classification bias and its loss function design, while ignoring the subtle influence of the regression branch. This paper shows that the regression bias exists and does adversely and seriously impact the detection accuracy. While existing methods fail to handle the regression bias, the class-specific regression head for rare classes is hypothesized to be the main cause of it in this paper. As a result, three kinds of viable solutions to cater for the rare categories are proposed, including adding a class-agnostic branch, clustering heads and merging heads. The proposed methods brings in consistent and significant improvements over existing long-tailed detection methods, especially in rare and common classes. The proposed method achieves state-of-the-art performance in the large vocabulary LVIS dataset with different backbones and architectures. It generalizes well to more difficult evaluation metrics, relatively balanced datasets, and the mask branch. This is the first attempt to reveal and explore rectifying of the regression bias in long-tailed object detection.


# 25
Strong Double Blind
PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition

Xiao Li · Yining Liu · Na Dong · Sitian Qin · Xiaolin Hu

Deep learning-based object recognition systems can be easily fooled by adversarial examples. One reason for the weak adversarial robustness may be that they do not have part-based inductive bias like the human recognition process. Motivated by this, several part-based recognition models have been proposed to improve the adversarial robustness of recognition. However, due to the lack of part annotations, the effectiveness of these part-based methods is only validated on small-scale nonstandard datasets. In this work, we propose PIN++, short for PartImageNet++, a dataset providing high-quality part segmentation annotations for all categories of ImageNet-1K (IN-1K). With these annotations, we build part-based methods directly on the standard IN-1K dataset for robust recognition. Different from previous two-stage part-based models, we propose a Multi-scale Part-supervised Model (MPM), to learn a robust representation with part annotations. Experiments show that MPM yielded better adversarial robustness on the large-scale IN-1K over strong baselines across various attack settings. Furthermore, MPM achieved improved robustness on common corruptions and several out-of-distribution datasets. The dataset, together with these results, enables and encourages researchers to explore the potential of part-based models in more real applications.


# 49
Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li · Weiwei Guo · Xue Yang · Ning Liao · Dunyun He · Jiaqi Zhou · Wenxian Yu

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. Two fundamental challenges hinder open vocabulary object detection performance: the qualities of both the class-agnostic region proposals and the pseudo-labels that can generalize well to novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 40.5% mAP, which outperforms previous methods Detic/ViLD by 23.7%/14.9% on the VisDroneZSD dataset. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aeria images.


# 66
Visible and Clear: Finding Tiny Objects in Difference Map

Bing Cao · Haiyu Yao · Pengfei Zhu · Qinghua Hu

Tiny object detection is one of the key challenges in the field of object detection. The performance of most generic detectors dramatically decreases in tiny object detection tasks. The main challenge lies in extracting effective features of tiny objects. Existing methods usually perform generation-based feature enhancement, which is seriously affected by spurious textures and artifacts, making it difficult to make the tiny-object-specific features visible and clear for detection. To address this issue, we propose a self-reconstructed tiny object detection (SR-TOD) framework. We for the first time introduce a self-reconstruction mechanism in the detection model, and discover the strong correlation between it and the tiny objects. Specifically, we impose a reconstruction head in-between the neck of a detector, constructing a difference map of the reconstructed image and the input, which shows high sensitivity to tiny objects. This inspires us to enhance the weak representations of tiny objects under the guidance of the difference maps. Thus, improving the visibility of tiny objects for the detectors. Building on this, we further develop a Difference Map Guided Feature Enhancement (DGFE) module to make the tiny feature representation more clear. In addition, we further propose a new multi-instance anti-UAV dataset, which is called DroneSwarms dataset and contains a large number of tiny drones with the smallest average size to date. Extensive experiments on the DroneSwarms dataset and other datasets demonstrate the effectiveness of the proposed method. The code and dataset will be publicly available.


# 216
IRGen: Generative Modeling for Image Retrieval

Yidan Zhang · Ting Zhang · DONG CHEN · Yujing Wang · Qi Chen · Xing Xie · Hao Sun · Weiwei Deng · Qi Zhang · Fan Yang · Mao Yang · Qingmin Liao · Jingdong Wang · Baining Guo

While generative modeling has become prevalent across numerous research fields, its integration into the realm of image retrieval remains largely unexplored and underjustified. In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling and employing a sequence-to-sequence model. This approach is harmoniously aligned with the current trend towards unification in research, presenting a cohesive framework that allows for end-to-end differentiable searching. This, in turn, facilitates superior performance enhancement via direct optimization techniques. The development of our model, dubbed IRGen, addresses the critical technical challenge of converting an image into a concise sequence of semantic units, which is pivotal for enabling efficient and effective search. Extensive experiments demonstrate that our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks as well as two million-scale datasets, yielding significant improvement compared to prior competitive retrieval methods. In addition, the notable surge in precision scores facilitated by generative modeling presents the potential to bypass the reranking phase, which is traditionally indispensable in practical retrieval workflows.


# 344
I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

Xiaobao Wei · Jiajun Cao · Yizhu Jin · Ming Lu · Guangyu Wang · Shanghang Zhang

With the development of Deep Neural Networks (DNNs), many efforts have been made to handle medical image segmentation. Traditional methods such as nnUNet train specific segmentation models on the individual datasets. Plenty of recent methods have been proposed to adapt the foundational Segment Anything Model (SAM) to medical image segmentation. However, they still focus on discrete representations to generate pixel-wise predictions, which are spatially inflexible and scale poorly to higher resolution. In contrast, implicit methods learn continuous representations for segmentation, which is crucial for medical image segmentation. In this paper, we propose I-MedSAM, which leverages the benefits of both continuous representations and SAM, to obtain better cross-domain ability and accurate boundary delineation. Since medical image segmentation needs to predict detailed segmentation boundaries, we designed a novel adapter to enhance the SAM features with high-frequency information during Parameter-Efficient Fine-Tuning (PEFT). To convert the SAM features and coordinates into continuous segmentation output, we utilize Implicit Neural Representation (INR) to learn an implicit segmentation decoder. We also propose an uncertainty-guided sampling strategy for efficient learning of INR. Extensive evaluations on 2D medical image segmentation tasks have shown that our proposed method with only 1.6M trainable parameters outperforms existing methods including discrete and implicit methods. The code will be released.


# 342
Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation

Mathias Öttl · Frauke Wilm · Jana Steenpass · Jingna Qiu · Matthias Rübner · Arndt Hartmann · Matthias W. Beckmann · Peter Fasching · Andreas K Maier · Ramona Erber · Bernhard Kainz · Katharina Breininger

Deep learning-based image generation has seen significant advancements with diffusion models, notably improving the quality of generated images. Despite these developments, generating images with unseen characteristics beneficial for downstream tasks has received limited attention. To bridge this gap, we propose Style-Extracting Diffusion Models, featuring two conditioning mechanisms. Specifically, we utilize 1) a style conditioning mechanism which allows to inject style information of previously unseen images during image generation and 2) a content conditioning which can be targeted to a downstream task, e.g., layout for segmentation. We introduce a trainable style encoder to extract style information from images, and an aggregation block that merges style information from multiple style inputs. This architecture enables the generation of images with unseen styles in a zero-shot manner, by leveraging styles from unseen images, resulting in more diverse generations. In this work, we use the image layout as target condition and first show the capability of our method on a natural image dataset as a proof-of-concept. We further demonstrate its versatility in histopathology, where we combine prior knowledge about tissue composition and unannotated data to create diverse synthetic images with known layouts. This allows us to generate additional synthetic data to train a segmentation network in a semi-supervised fashion. We verify the added value of the generated images by showing improved segmentation results and lower performance variability between patients when synthetic images are included during segmentation training. Our code will be made publicly available at [LINK].


# 27
Strong Double Blind
Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification

Yu Bai · Bo Zhang · Zheng Zhang · Shuo Yan · Zibo Ma · Wu Liu · Xiuzhuang Zhou · Xiangyang Gong · Wendong Wang

In recent years, the Whole Slide Image (WSI) classification task has achieved great advancement due to the success of Multiple Instance Learning (MIL). However, MIL-based methods face two limitations: 1) often select the top-ranking instances of a WSI based on different metrics (e.g., attention score) to train the model due to the large resolution of WSIs, which may lead to missing global information; 2) usually consider all instances within a bag to be unordered, which will cause the local context information to be missing. To address the limitations of MIL-based methods, we formulate the WSI classification task as a long sequence classification problem in a weakly supervised setting. We propose a Noise Robust Memory-augmented (Norma) framework that serializes the WSI into an ordered sequence and caches each segment for future reuse in a sequential manner. By applying such paradiam, global and local context information of a WSI can be obtained during training. Furthermore, Normal adopts a Cyclic Training process to eliminate the noise introduced by the WSI-level labe. We obtains state-of-the-art results on CAMELYON-16, TCGA-BRAC and TCGA-LUNG datasets. We will release the code upon acceptance.


# 343
GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

Ibrahim Ethem Hamamci · Sezgin Er · Anjany Sekuboyina · Enis Simsar · Alperen Tezcan · Ayse Gulnihan Simsek · Sevval Nil Esirgun · Furkan Almas · Irem Dogan · Muhammed Furkan Dasdelen · Chinmay Prabhakar · Hadrien Reynaud · Sarthak Pati · Christian Bluethgen · Mehmet Kemal Ozdemir · Bjoern Menze

Text-conditional medical image generation is vital for radiology, augmenting small datasets, preserving data privacy, and enabling patient-specific modeling. However, its applications in 3D medical imaging, such as CT and MRI, which are crucial for critical care, remain unexplored. In this paper, we introduce GenerateCT, the first approach to generating 3D medical imaging conditioned on free-form medical text prompts. GenerateCT incorporates a text encoder and three key components: a novel causal vision transformer for encoding 3D CT volumes, a text-image transformer for aligning CT and text tokens, and a text-conditional super-resolution diffusion model. Without directly comparable methods in 3D medical imaging, we benchmarked GenerateCT against cutting-edge methods, demonstrating its superiority across all key metrics. Importantly, we explored GenerateCT's clinical applications by evaluating its utility in a multi-abnormality classification task. First, we established a baseline by training a multi-abnormality classifier on our real dataset. To further assess the model's generalization to external datasets and its performance with unseen prompts in a zero-shot scenario, we employed an external dataset to train the classifier, setting an additional benchmark. We conducted two experiments in which we doubled the training datasets by synthesizing an equal number of volumes for each set using GenerateCT. The first experiment demonstrated an 11% improvement in the AP score when training the classifier jointly on real and generated volumes. The second experiment showed a 7% improvement when training on both real and generated volumes based on unseen prompts. Moreover, GenerateCT enables the scaling of synthetic training datasets to arbitrary sizes. As an example, we generated 100,000 3D CT volumes, fivefold the number in our real dataset, and trained the classifier exclusively on these synthetic volumes. Impressively, this classifier surpassed the performance of the one trained on all available real data by a margin of 8%. Lastly, domain experts evaluated the generated volumes, confirming a high degree of alignment with the text prompts. Access our code, model weights, training data, and generated data at https://github.com/ibrahimethemhamamci/GenerateCT


# 44
BugNIST - a Large Volumetric Dataset for Detection under Domain Shift

Patrick Jensen · Vedrana Dahl · Rebecca Engberg · Carsten Gundlach · Hans Martin Kjer · Anders Bjorholm Dahl

Domain shift significantly influences the performance of deep learning algorithms, particularly for object detection within volumetric 3D images. Annotated training data is essential for deep learning-based object detection. However, annotating densely packed objects is time-consuming and costly. Instead, we suggest training models on individually scanned objects, causing a domain shift between training and detection data. To address this challenge, we introduce the BugNIST dataset, comprising 9154 micro-CT volumes of 12 bug types and 388 volumes of tightly packed bug mixtures. This dataset is characterized by having objects with the same appearance in the source and target domain, which is uncommon for other benchmark datasets for domain shift. During training, individual bug volumes labeled by class are utilized, while testing employs mixtures with center point annotations and bug type labels. Together with the dataset, we provide a baseline detection analysis, aiming at advancing the field of 3D object detection methods.


# 32
Strong Double Blind
AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset

Jan Lehr · Jan H Philipps · Alik Sargsyan · Martin Pape · Jörg Krüger

The field of visual Industrial Anomaly Detection (IAD) has brought forth many new semi-supervised learning methods in recent years. At the same time, there have been few new datasets for benchmarking the methods. The most popular dataset is MVTec-AD dataset, because of its diversity of categories and availability of industrial objects. But many methods already achieve AUROC scores of more than 99 % on the MVTec-AD dataset. The defects of the categories that the dataset provides appear to be easily detectable. Furthermore, there is no existing approach to statistically describe the defects that need to be found in IAD datasets. This paper presents a new dataset for visual industrial anomaly detection and a novel approach for Anomaly Detection Dataset Difficulty assessment with the AD3 score. The new dataset named VIADUCT contains 49 categories and 10,986 high resolution images from eleven different sectors. Through the support of several manufacturing companies, numerous real inspection problems are presented through the dataset. It contains a large number of different defects with detailed pixel-wise annotations. The VIADUCT dataset is compared with other state of the art datasets to underline its added value. Therefore, we provide an overview for each dataset regarding the number of categories, images, defect categories and defects. In addition to these obvious comparisons the defects of the datasets are described with the AD3 score. This novel score is used to analyze the size of the defects and the similarity between the defect and its corresponding object. Using seven selected methods from industrial anomaly detection, a benchmark is performed on the new dataset, showing that there is still potential for improvement. It is shown that the VIADUCT dataset is the largest dataset in the field of image-based industrial anomaly detection. In addition to its very small defects which are hard to recognize, the dataset also offers the greatest variance of possible defects and the most defect classes. Describing the datasets with AD3 score it can be found that VIADUCT dataset have the smallest and most inconspicuous defects. With the AD3 score we are able to create a-priori knowledge for every single defect in IAD datasets. The AD3 score correlates with the results of the IAD method benchmark, showing that it can be used to estimate defect detection difficulty. In the future, new objects can be assessed to see whether defects can be recognized using IAD methods before an energy-intensive benchmark is performed. The simple calculation of the AD3 score generates valuable a-priori knowledge and can save resources.


# 29
GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

hang yao · Ming LIU · Zhicun Yin · Zifei Yan · Xiaopeng Hong · Wangmeng Zuo

Diffusion models have shown superior performance on unsupervised anomaly detection tasks. Since trained with normal data only, diffusion models tend to reconstruct normal counterparts of test images with certain noises added. However, these methods treat all potential anomalies equally, which may cause two main problems. On the one hand, the difficulty of reconstructing images with different anomalies is uneven. For example, adding back a missing element is harder than dealing with a scratch, thus requiring larger denoising steps in diffusion models. Therefore, instead of utilizing the same setting for all samples, we propose to predict a particular denoising step for each sample by evaluating the difference between image contents and the priors extracted from diffusion models. On the other hand, even in the same image, reconstructing abnormal regions differs from normal areas. Theoretically, the diffusion model predicts a noise for each step, typically following a standard Gaussian distribution. However, due to the difference between the anomaly and the potential normal sample, the predicted noise in abnormal regions will inevitably deviate from the standard Gaussian distribution. To this end, we propose introducing synthetic abnormal samples in training to encourage the diffusion models to break through the limitation of standard Gaussian distribution, and an adaptive feature fusion scheme is utilized during inference. With the above modifications, we propose a global and local adaptive diffusion model (abbreviated to GLAD) for unsupervised anomaly detection, which introduces appealing flexibility and achieves anomaly-free reconstruction while retaining as much normal information as possible. Extensive experiments are conducted on two commonly used anomaly detection datasets (MVTec-AD and MPDD), showing the effectiveness of the proposed method. The source code and pre-trained models will be publicly available.


# 31
Strong Double Blind
Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions

Declan GD McIntosh · Alexandra Branzan Albu

We propose Online-InReaCh, the first fully unsupervised online method for detecting and localizing anomalies on-the-fly in image sequences while following non-stationary distributions. Previous anomaly detection methods are limited to supervised one-class classification or are unsupervised but still pre-compute their nominal model. Online-InReaCh can operate online by dynamically maintaining a nominal model of commonly occurring patches that associate well across image realizations of the underlying nominal distribution while removing stale previously nominal patches. Online-InReaCh, while competitive in previous offline benchmarks, also achieves 0.936 and 0.961 image- and pixel-wise AUROC when tested online on MVTecAD, where 23.8% of all randomly sampled images contain anomalies. Online-InReaCh's performance did not correlate with anomaly proportion even to 33.5%. We also show that Online-InReaCh can integrate new nominal structures and distinguish anomalies after a single frame, even in the worst-case distribution shift from one training class to a new previously unseen testing class.


# 22
Strong Double Blind
Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

Yashika Jain · Ali Dabouei · Min Xu

Video Anomaly Detection (VAD) automates the identification of unusual events, such as security threats in surveillance videos. In real-world applications, VAD models must effectively operate in cross-domain settings, identifying rare anomalies and scenarios not well-represented in the training data. However, existing cross-domain VAD methods focus on unsupervised learning, resulting in performance that falls short of real-world expectations. Since acquiring weak supervision for the source domain is cost-effective, we conjecture that combining it with external unlabeled data has notable potential to enhance cross-domain performance. To this end, we introduce a novel weakly supervised framework for Cross-Domain Learning (CDL) in VAD that incorporates external data during training by estimating its prediction bias and adaptively minimizing that using the predicted uncertainty. We demonstrate the effectiveness of the proposed CDL framework through comprehensive experiments conducted in various configurations on two large-scale VAD datasets: UCF-Crime and XD-Violence. Our method significantly surpasses the state-of-the-art works in cross-domain evaluations, achieving an average absolute improvement of 19.6% on UCF-Crime and 12.87% on XD-Violence.


# 276
Strong Double Blind
Attention Beats Linear for Fast Implicit Neural Representation Generation

Shuyi Zhang · Ke Liu · Jingjun Gu · Xiaoxu Cai · Zhihua Wang · Jiajun Bu · Haishuai Wang

Implicit Neural Representation (INR) has gained increasing popularity as a data representation method, serving as a prerequisite for innovative generation models. Unlike gradient-based methods, which exhibit lower efficiency in inference, the adoption of hyper-network for generating parameters in Multi-Layer Perceptrons (MLP), responsible for executing INR functions, has surfaced as a promising and efficient alternative. However, as a global continuous function, MLP is challenging in modeling highly discontinuous signals, resulting in slow convergence during the training phase and inaccurate reconstruction performance. Moreover, MLP requires massive representation parameters, which implies inefficiencies in data representation. In this paper, we propose a novel Attention-based Localized INR (ANR) composed of a localized attention layer (LAL) and an MLP that integrates coordinate features with data features and converts them to meaningful outputs. Subsequently, we design an instance representation framework that delivers a transformer-like hyper-network to represent data instances as a compact representation vector. With instance-specified representation vector and instance-agnostic ANR parameters, the target signals are well reconstructed as a continuous function. We further address aliasing artifacts with variational coordinates when obtaining the super-resolution inference results. Extensive experimentation across four datasets showcases the notable efficacy of our ANR method, e.g. enhancing the PSNR value from 37.95dB to 47.25dB on the CelebA dataset.


# 166
Strong Double Blind
OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

JINGYANG XIANG · Zuohui Chen · Siqi Li · Qing Wu · Yong Liu

Binary Neural Network (BNN) has been proven highly effective for deploying deep learning models on mobile and embedded platforms. Most existing works focus on either designing a gradient approximation to alleviate gradient mismatch for BNNs, minimizing the quantization error, or improving representation ability, while leaving the weight flip, a critical factor for achieving powerful BNNs, untouched. In this paper, we investigate the update efficiency of weight signs. We observe that, for vanilla BNNs, over 50\% of the weights remain their signs unchanged during training, and these weights are unevenly distributed throughout the network. We refer to these weights as silent weights'', which slow down convergence and lead to significant accuracy degradation. Theoretically, we demonstrate this is due to the gradients of the BNN being independent of their latent weight distribution. To this end, we propose Overcome Silent Weights~(OvSW) to address the issue. OvSW first employs Adaptive Gradient Scaling~(AGS) to establish the relationship between gradient and latent weight distribution thus improving the update efficiency of signs for overall weights. Then, we design Silence Awareness Decaying~(SAD) to automatically detectsilent weights'' and apply additional penalty to facilitate their flipping. By efficiently updating weight signs, our method achieves faster convergence and state-of-the-art performance on CIFAR10 and ImageNet1K with various architectures. OvSW obtains 61.4\% top-1 accuracy on the ImageNet1K using binarized ResNet18 architecture, exceeding the state-of-the-art by over 0.4\%. Codes are anonymously available at \url{https://anonymous.4open.science/r/OvSW-1696}. \keywords{Binary Neural Network \and Silent Weights \and Adaptive Gradient Scaling \and Silence Awareness Decaying}


# 246
Strong Double Blind
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

Carlos Hinojosa · Shuming Liu · Bernard Ghanem

Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework, offering remarkable performance across a wide range of downstream tasks. To increase the difficulty of the pretext task and learn richer visual representations, existing works have focused on replacing standard random masking with more sophisticated strategies, such as adversarial-guided and teacher-guided masking. However, these strategies depend on the input data, commonly requiring additional computations, networks, or losses to generate the mask pattern. This raises the question: Can we enhance MAE performance beyond random masking without relying on input data or incurring additional computational costs? In this work, we introduce a simple yet effective data-independent method, termed ColorMAE, which generates masks by altering white noise with different filters on its spectrum. Drawing inspiration from color noise in image processing, we explore four types of filters to yield mask patterns with different spatial and semantic priors. ColorMAE requires no additional learnable parameters or computational overhead in the network, yet it significantly enhances the learned representations. We provide a comprehensive empirical evaluation, demonstrating our strategy's superiority in various downstream tasks compared to random masking. Notably, we report an improvement of 2.72 in mIoU in semantic segmentation tasks relative to baseline MAE implementations.


# 287
Strong Double Blind
AttnZero: Efficient Attention Discovery for Vision Transformers

Lujun Li · Zimian Wei · Peijie Dong · Wenhan Luo · Wei Xue · Qifeng Liu · Yike Guo

In this paper, we present AttnZero, the first framework for automatically discovering efficient attention modules tailored for Vision Transformers (ViTs). While traditional self-attention in ViTs suffers from quadratic computation complexity, linear attention offers a more efficient alternative with linear complexity approximation. However, existing hand-crafted linear attention suffers from performance degradation. To address these issues, our AttnZero constructs search spaces and employs evolutionary algorithms to discover potential linear attention formulations. Specifically, our search space consists of six kinds of computation graphs and advanced activation, normalize, and binary operators. To enhance generality, we derive results of candidate attention applied to multiple advanced ViTs as the multi-objective for the evolutionary search. To expedite the search process, we utilize program checking and rejection protocols to filter out unpromising candidates swiftly. Additionally, we develop Attn-Bench-101, which provides precomputed performance of 2,000 attentions in the search spaces, enabling us to summarize attention design insights. Experimental results demonstrate that the discovered AttnZero module generalizes well to different tasks and consistently achieves improved performance across various ViTs. For instance, the tiny model of DeiT|PVT|Swin|CSwin trained with AttnZero on ImageNet reaches 74.9\%|78.1\%|82.1\%|82.9\% top-1 accuracy. Codes are available in the Appendix.


# 337
Strong Double Blind
Isomorphic Pruning for Vision Models

Gongfan Fang · Xinyin Ma · Michael Bi Mi · Xinchao Wang

Structured pruning reduces the computational overhead of deep neural networks by removing redundant sub-structures. However, assessing the relative importance of different sub-structures remains a significant challenge, particularly in advanced vision models featuring novel mechanisms and architectures like self-attention, depth-wise convolutions, or residual connections. These heterogeneous substructures usually exhibit diverged parameter scales, weight distributions, and computational topology, introducing considerable difficulty to importance comparisons. To overcome this, we present Isomorphic Pruning, a simple approach that demonstrates effectiveness across a range of network architectures such as Vision Transformers and ConvNexts, and delivers competitive performance across different model sizes. Isomorphic Pruning originates from an observation that, when evaluated under a pre-defined importance criterion, heterogeneous sub-structures demonstrate significant divergence in their importance distribution, as opposed to isomorphic structures that present similar importance patterns. This inspires us to perform isolated ranking and comparison on different types of sub-structures for more reliable pruning. Our empirical results on ImageNet-1K demonstrate that Isomorphic Pruning surpasses several pruning baselines dedicatedly designed for CNNs or Transformers. For instance, we improve the accuracy of DeiT-Tiny from 74.52% to 77.50% by pruning an off-the-shelf DeiT-Base model. And for ConvNext-Tiny, we enhanced performance from 82.06% to 82.18%, while reducing the number of parameters and memory usage.


# 335
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Donghyun Kim · Byeongho Heo · Dongyoon Han

This paper revives Densely Connected Convolutional Networks (DenseNets) and reveals the underrated effectiveness over predominant ResNet-style architectures. We believe DenseNets' potential was overlooked due to untouched training methods and traditional design elements not fully revealing their capabilities. Our pilot study shows dense connections through concatenation are strong, demonstrating that DenseNets can be revitalized to compete with modern architectures. We methodically refine suboptimal components - architectural adjustments, block redesign, and improved training recipes towards widening DenseNets and boosting memory-efficiency, while keeping concatenation shortcuts. Our models, employing simple architectural elements, ultimately surpass Swin Transformer, ConvNeXt, and DeiT-III — key architectures in the residual learning lineage. Furthermore, our models exhibit near state-of-the-art performance on ImageNet-1k, competing with the very recent models and downstream tasks, ADE20k semantic segmentation, and COCO object detection/instance segmentation. Finally, we provide empirical analyses that uncover the merits of the concatenation over additive shortcuts, steering a renewed preference towards DenseNet-style designs. Our code will be publicly available.


# 336
Strong Double Blind
Robustness Tokens: Towards Adversarial Robustness of Transformers

Brian Pulfer · Yury Belousov · Slava Voloshynovskiy

Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.


# 324
Strong Double Blind
Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration

Dongwon Park · Hayeon Kim · Se Young Chun

Recently, pre-trained model and efficient parameter tuning have achieved remarkable success in natural language processing and high-level computer vision with the aid of masked modeling and prompt tuning. In low-level computer vision, however, there have been limited investigations on pre-trained models and even efficient fine-tuning strategy has not yet been explored despite its importance and benefit in various real-world tasks such as alleviating memory inflation issue when integrating new tasks on AI edge devices. Here, we propose a novel efficient parameter tuning approach dubbed contribution-based low-rank adaptation (CoLoRA) for multiple image restorations along with effective pre-training method with random order degradations (PROD). Unlike prior arts that tune all network parameters, our CoLoRA effectively fine-tunes small amount of parameters by leveraging LoRA (low-rank adaptation) for each new vision task with our contribution-based method to adaptively determine layer by layer capacity for that task to yield comparable performance to full tuning. Furthermore, our PROD strategy allows to extend the capability of pre-trained models with improved performance as well as robustness to bridge synthetic pre-training and real-world fine-tuning. Our CoLoRA with PROD has demonstrated its superior performance in various image restoration tasks across diverse degradation types on both synthetic and real-world datasets for known and novel tasks.


# 219
Strong Double Blind
Neural Spectral Decomposition for Dataset Distillation

Yang Shaolei · Shen Cheng · Mingbo Hong · Haoqiang Fan · Xing Wei · Shuaicheng Liu

In this paper, we propose Neural Spectrum Decomposition, a generic decomposition framework for dataset distillation. Unlike previous methods, we consider the entire dataset as a high-dimensional observation that is low-rank across all dimensions. We aim to discover the low-rank representation of the entire dataset and perform distillation efficiently. Toward this end, we learn a set of spectrum tensors and transformation matrices, which, through simple matrix multiplication, reconstruct the data distribution. Specifically, a spectrum tensor can be mapped back to the image space by a transformation matrix, and efficient information sharing during the distillation learning process is achieved through pairwise combinations of different spectrum vectors and transformation matrices. Furthermore, we integrate a trajectory matching optimization method guided by a real distribution. Our experimental results demonstrate that our approach achieves state-of-the-art performance on benchmarks, including CIFAR10, CIFAR100 and Tiny Imagenet.


# 1
Strong Double Blind
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

Taesup Kim · Donggeun Kim

Multimodal learning typically relies on the assumption that all modalities are fully available during both the training and inference phases. However, in real-world scenarios, consistently acquiring complete multimodal data presents significant challenges due to various factors. This often leads to the issue of missing modalities, where data for certain modalities are absent, posing considerable obstacles not only for the availability of multimodal pretrained models but also for their fine-tuning and the preservation of robustness in downstream tasks. To address these challenges, we propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method. This framework enables the model to predict the embedding of a missing modality in the representation space during inference. Our method effectively predicts the missing embedding through prompt tuning, leveraging information from available modalities and enhancing inter-modal interaction. We evaluate our approach on several multimodal benchmark datasets and demonstrate its effectiveness and robustness across various scenarios of missing modalities.


# 338
Adaptive Multi-head Contrastive Learning

Lei Wang · Piotr Koniusz · Tom Gedeon · Liang Zheng

In contrastive learning, two views of an original image, generated by different augmentations, are considered a positive pair, and their similarity is required to be high. Similarly, two views of distinct images form a negative pair, with encouraged low similarity. Typically, a single similarity measure, provided by a lone projection head, evaluates positive and negative sample pairs. However, due to diverse augmentation strategies and varying intra-sample similarity, views from the same image may not always be similar. Additionally, owing to inter-sample similarity, views from different images may be more akin than those from the same image. Consequently, enforcing high similarity for positive pairs and low similarity for negative pairs may be unattainable, and in some cases, such enforcement could detrimentally impact performance. To address this challenge, we propose using multiple projection heads, each producing a distinct set of features. Our pre-training loss function emerges from a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations. This loss incorporates the similarity measure over positive and negative pairs, each re-weighted by an individual adaptive temperature, regulated to prevent ill solutions. Our approach, Adaptive Multi-Head Contrastive Learning (AMCL), can be applied to and experimentally enhances several popular contrastive learning methods such as SimCLR, MoCo, and Barlow Twins. The improvement remains consistent across various backbones and linear probing epochs, and becomes more significant when employing multiple augmentation methods.


# 24
Strong Double Blind
Unsqueeze [CLS] Bottleneck to Learn Rich Representations

Qing Su · Shihao Ji

Distillation-based self-supervised learning typically leads to more compressed representations due to its radical clustering process and the implementation of a sharper target distribution. To overcome this limitation and preserve more information from the input, we introduce UDI, conceptualized as Unsqueezed DIstillation-based self-supervised learning (SSL). UDI enriches the learned representation by encouraging multi-modal prediction distilled from a consolidated profile of local predictions that are derived via stratified sampling. Our evaluations show that UDI not only promotes semantically meaningful representations at instance level, delivering superior or competitive results to state-of-the-art SSL methods in image classification, but also effectively preserves the nuisance of input, which yields significant improvement in dense prediction tasks, including object detection and segmentation. Additionally, UDI performs competitively in low-shot image classification, improving the scalability of joint-embedding pipelines. Various visualizations and ablation studies are presented to further elucidate the mechanisms behind UDI.


# 201
Strong Double Blind
Improving Zero-Shot Generalization for CLIP with Variational Adapter

Ziqian Lu · Fengli Shen · Mushui Liu · Yunlong Yu · Xi Li

Thanks to the excellent generalization capability of pre-trained Vision-Language Models (VLMs) such as CLIP, fine-tuning VLMs for downstream tasks (e.g., zero-shot generalization) has become a popular choice. Despite achieving promising performance in the professionality of base classes, most existing fine-tuned methods suffer from feature confusion of novel classes, resulting in unsatisfactory transferability. To address this problem, we propose a divide-and-conquer approach called Prompt-based Variational Adapter (PVA) that explicitly reduce the prediction bias by separating base and novel samples. Specifically, we design two variational adapters with learnable textual tokens to align latent representations for each modalities in a shared latent space. Once trained, we can separate novel samples from entangled space using the similarity metric of latent features i.e., converting confusion task into two independent ones (One for base classes and the other for novel classes). Moreover, to improve the transferability for novel classes, we further refine the output features of the learned adapters with the global features via a residual connection. To the best of our knowledge, this is the first framework which combines prompt learning and adapter tuning to tackle the feature confusion issue. We conduct extensive experiments on GZSL and Cross-Dataset Transfer Learning to demonstrate the superiority of our approach and establish a new state-of-the-art on four popular benchmarks.


# 21
Strong Double Blind
Learning to Obstruct Few-Shot Image Classification over Restricted Classes

Amber Yijia Zheng · Chiao-An Yang · Raymond Yeh

Advancements in open-source pre-trained backbones make it relatively easy to fine-tune a model for new tasks. However, this lowered entry barrier poses potential risks, e.g., bad actors developing models for harmful applications. A question arises: "Is possible to develop a pre-trained model that is difficult to fine-tune for certain downstream tasks?" To begin studying this, we focus on few-shot classification (FSC). Specifically, we investigate methods to make FSC more challenging for a set of restricted classes while maintaining the performance of other classes. We propose to meta-learn over the pre-trained backbone in a manner that renders it a ``poor initialization''. Our proposed Learning to Obstruct (LTO) algorithm successfully obstructs four FSC methods across three datasets, including ImageNet and CIFAR100 for image classification, as well as CelebA for attribute classification.


# 28
Strong Double Blind
Improving Hyperbolic Representations via Gromov-Wasserstein Regularization

yifei Yang · Wonjun Lee · Dongmian Zou · Gilad Lerman

Hyperbolic representations have shown remarkable efficacy in modeling inherent hierarchies and complexities within data structures. Hyperbolic neural networks have been commonly applied for learning such representations from data, but they often fall short in preserving the geometric structures of the original feature spaces. In response to this challenge, our work applies the Gromov-Wasserstein (GW) distance as a novel regularization mechanism within hyperbolic neural networks. The GW distance quantifies how well the original data structure is maintained after embedding the data in a hyperbolic space. Specifically, we explicitly treat the layers of the hyperbolic neural networks as a transport map and calculate the GW distance accordingly. We validate that the GW distance computed based on a training set well approximates the GW distance of the underlying data distribution. Our approach demonstrates consistent enhancements over current state-of-the-art methods across various tasks, including few-shot image classification, as well as semi-supervised graph link prediction and node classification.


# 26
Strong Double Blind
HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions

Chiranjeev Chiranjeev · Muskan Dosi · Kartik Thakral · MAYANK VATSA · RICHA SINGH

Traditional deep learning models rely on methods such as softmax cross-entropy and ArcFace loss for tasks like classification and face recognition. These methods mainly explore angular features in a hyperspherical space, often resulting in entangled inter-class features due to dense angular data across many classes. In this paper, a new field of feature exploration is proposed known as \textit{HyperSpaceX} which enhances class discrimination by exploring both angular and radial dimensions in multi-hyperspherical spaces, faciliated by a novel \textit{DistArc} loss. The proposed DistArc loss encompasses three feature arrangement components: two angular and one radial, enforcing intra-class binding and inter-class separation in multi-radial arrangement improving feature discriminability. Evaluation of \textit{HyperSpaceX} framework for the novel representation utilizes a proposed predictive measure that accounts for both angular and radial elements, providing a more comprehensive assessment of model accuracy beyond standard metrics. Experiments across six object classification and five face recognition datasets demonstrate state-of-the-art \textit{(SoTA)} results obtained from \textit{HyperSpaceX}, achieving up to a 20\% performance improvement on large-scale object datasets.


# 15
Strong Double Blind
Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density

Peiyu Yang · Naveed Akhtar · Shah Mubarak · Ajmal Mian

Trustworthy machine learning necessitates meticulous regulation of model reliance on non-robust features. We propose a framework to delineate and regulate such features by attributing model predictions to the input. Within our approach, robust feature attributions exhibit a certain consistency, while non-robust feature attributions are susceptible to fluctuations. This behavior allows identification of correlation between model reliance on non-robust features and smoothness of marginal density of the input samples. Hence, we uniquely regularize the gradients of the marginal density w.r.t.~the input features for robustness. We also devise an efficient implementation of our regularization to address the potential numerical instability of the underlying optimization process. Moreover, we analytically reveal that, as opposed to our marginal density smoothing, the prevalent input gradient regularization smoothens conditional or joint density of the input, which can cause limited robustness. Our experiments validate the effectiveness of the proposed method, providing clear evidence of its capability to address the feature leakage problem and mitigate spurious correlations. Extensive results further establish that our technique enables the model to exhibit robustness against perturbations in pixel values, input gradients, and density.


# 14
SCOD: From Heuristics to Theory

Vojtech Franc · Jakub Paplham · Daniel Prusa

This paper addresses the problem of designing reliable prediction models that abstain from predictions when faced with uncertain or out-of-distribution samples - a recently proposed problem known as Selective Classification in the presence of Out-of-Distribution data (SCOD). We make three key contributions to SCOD. Firstly, we demonstrate that the optimal SCOD strategy involves a Bayes classifier for in-distribution (ID) data and a selector represented as a stochastic linear classifier in a 2D space, using i) the conditional risk of the ID classifier, and ii) the likelihood ratio of ID and out-of-distribution (OOD) data as input. This contrasts with suboptimal strategies from current OOD detection methods and the Softmax Information Retaining Combination (SIRC), specifically developed for SCOD. Secondly, we establish that in a distribution-free setting, the SCOD problem is not Probably Approximately Correct learnable when relying solely on an ID data sample. Third, we introduce POSCOD, a simple method for learning a Plugin estimate of the Optimal SCOD strategy from both an ID data sample and an unlabeled mixture of ID and OOD data. Our empirical results confirm the theoretical findings and demonstrate that our proposed method, POSCOD, outperforms existing OOD methods in effectively addressing the SCOD problem.


# 16
LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration

Siqi Wang · Bryan Plummer

Learning with noisy labels (LNL) aims to train a high-performing model using a noisy dataset. We observe that noise for a given class often comes from a limited set of categories, yet many LNL methods overlook this. For example, an image mislabeled as a cheetah is more likely a leopard than a hippopotamus due to its visual similarity. Thus, we explore Learning with Noisy Labels with noise source Knowledge integration (LNL+K), which takes advantage of knowledge about likely source(s) of label noise that is often already provided in a dataset's meta-data. We find that integrating noise source knowledge boosts performance even in settings where LNL methods typically fail. For example, LNL+K methods are effective on datasets where noise represents the majority of samples, which breaks a critical premise of most methods developed for LNL. We also find that LNL+K methods can boost performance even when the noise sources are estimated rather than provided in the meta-data. Our experiments provide several baseline LNL+K methods that integrate noise source knowledge into state-of-the-art LNL models evaluated across six diverse datasets and two types of noise, where we report gains of up to 23% compared to the unadapted methods. Critically, we show that LNL methods fail to generalize on some real-world datasets, even when adapted to integrate noise source knowledge, highlighting the importance of directly exploring LNL+K.


# 18
Strong Double Blind
SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

ZERUN WANG · Liuyu Xiang · Lang Huang · Jiafeng Mao · Ling Xiao · Toshihiko Yamasaki

Open-set semi-supervised learning (OSSL) leverages practical open-set unlabeled data for semi-supervised learning (SSL). Open-set unlabeled data comprises in-distribution (ID) samples from seen classes and out-of-distribution (OOD) samples from unseen classes. Prior OSSL methods initially learn the decision boundary of ID and OOD with labeled ID data, followed by self-training to enhance it. These methods, however, suffer from the tendency to overtrust the labeled ID data: the distribution bias between the limited labeled samples and the entire ID data misleads the decision boundary to overfit. The subsequent self-training process, based on the overfitted result, fails to rectify this problem. In this paper, we address the overtrusting issue by treating OOD as an additional class and forming a new SSL process. Specifically, we propose SCOMatch, a novel OSSL method that 1) selects reliable OOD samples as new labeled data by our OOD memory queue and corresponding update strategy, and 2) integrates the new SSL process into the original task through our \textbf{S}imultaneous \textbf{C}lose-set and \textbf{O}pen-set self-training. SCOMatch refines the decision boundary of ID and OOD classes across the entire dataset, thereby leading to a better result. Extensive experimental results show that SCOMatch significantly outperforms the state-of-the-art methods on various benchmarks. Meanwhile, the effectiveness is verified through ablation studies and visualization.


# 23
Labeled Data Selection for Category Discovery

Bingchen Zhao · Nico Lang · Serge Belongie · Oisin Mac Aodha

Category discovery methods aim to find novel categories in unlabeled visual data. At training time, a set of labeled and unlabeled images are provided, where the labels correspond to the categories present in the images. The labeled data provides guidance during training by indicating what types of visual properties and features are relevant for performing discovery in the unlabeled data. As a result, changing the categories present in the labeled set can have a large impact on what is ultimately discovered in the unlabeled set. Despite its importance, the impact of labeled data selection has not been explored in the category discovery literature to date. We show that changing the labeled data can significantly impact discovery performance. Motivated by this, we propose two new approaches for automatically selecting the most suitable labeled data based on the similarity between the labeled and unlabeled data. Our observation is that, unlike in conventional supervised transfer learning, the best labeled is neither too similar, nor too dissimilar, to the unlabeled categories. Our resulting approaches obtains state-of-the-art discovery performance across a range of challenging fine-grained benchmark datasets.


# 12
Strong Double Blind
PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery

Fernando Julio Cendra · Bingchen Zhao · Kai Han

The primary objective of Continual Category Discovery (CCD) is to automatically discover novel categories in the continuous stream of unlabelled data without experiencing catastrophic forgetting, which remains an open problem even in conventional, fully supervised continual learning. To address this challenge, we propose PromptCCD, a simple yet effective framework that utilizes Gaussian mixture model as a prompting method for CCD. At the core of PromptCCD is the Gaussian Mixture Prompting (GMP) module, which acts as a dynamic pool updating over time to guide embedding data representation and avoid forgetting during category discovery. Additionally, GMP enables on-the-fly estimation of category numbers, which allows PromptCCD to discover categories in the unlabelled data without prior knowledge of category numbers. We extend the standard evaluation metrics for Generalized Category Discovery to CCD and benchmark state-of-the-art methods using different datasets. PromptCCD significantly outperforms other methods, demonstrating the effectiveness of our approach. Code will be available.


# 2
Strong Double Blind
Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Hao Dong · Eleni Chatzi · Olga Fink

The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code will be made available.


# 8
Strong Double Blind
Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation

Hritam Basak · Zhaozheng Yin

Semi-supervised Domain Adaptation (SSDA) encompasses the process of adapting representations acquired from the source domain to a new target domain, utilizing a limited number of labeled samples in conjunction with an abundance of unlabeled data from the target domain. Simple aggregation of domain adaptation (DA) and semi-supervised learning (SSL) falls short of optimal performance due to two primary challenges: (1) skewed training data distribution favoring the source representation learning, and (2) the persistence of superfluous domain-specific features, hindering effective domain-agnostic (i.e., task-specific) feature extraction. In pursuit of greater generalizability and robustness, we present an SSDA framework with a new episodic learning strategy: \lq\lq learn, forget, then learn more\rq\rq. First, we train two encoder-classifier pairs, one for the source and the other for the target domain, aiming to learn domain-specific features. This involves minimizing classification loss for in-domain images and maximizing uncertainty loss for out-of-domain images. Subsequently, we transform the images into a new space, strategically unlearning (forgetting) the domain-specific representations while preserving their structural similarity to the originals. This proactive removal of domain-specific attributes is complemented by learning more domain-agnostic features using a Gaussian-guided latent alignment (GLA) strategy that uses a prior distribution to align domain-agnostic source and target representations. The proposed SSDA framework can be further extended to unsupervised domain adaptation (UDA). Evaluation across {two} domain adaptive image classification tasks reveals our method's superiority over state-of-the-art (SoTA) methods in both SSDA and UDA scenarios. Code will be released.


# 11
Strong Double Blind
CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning

Junghun Oh · Sungyong Baik · Kyoung Mu Lee

Aiming to incrementally learn new classes with only few samples while preserving the knowledge of base (old) classes, few-shot class-incremental learning (FSCIL) faces several challenges, such as overfitting and catastrophic forgetting. Such a challenging problem is often tackled by fixing a feature extractor trained on base classes to reduce the adverse effects of overfitting and forgetting. Under such formulation, our primary focus is representation learning on base classes to tackle the unique challenge of FSCIL: simultaneously achieving the transferability and the discriminability of the learned representation. Building upon the recent efforts for enhancing transferability, such as promoting the spread of features, we find that trying to secure the spread of features within a more confined feature space enables the learned representation to strike a better balance between transferability and discriminability. Thus, in stark contrast to prior beliefs that the inter-class distance should be maximized, we claim that the closer different classes are, the better for FSCIL. The empirical results and analysis from the perspective of information bottleneck theory justify our simple yet seemingly counter-intuitive representation learning method, raising research questions and suggesting alternative research directions.


# 7
Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling

Wonho Bae · Jing Wang · Danica J. Sutherland

Most meta-learning methods assume that the (very small) context set used to establish a new task at test time is passively provided. In some settings, however, it is feasible to actively select which points to label; the potential gain from a careful choice is substantial, but the setting requires major differences from typical active learning setups. We clarify the ways in which active meta-learning can be used to label a context set, depending on which parts of the meta-learning process use active learning. Within this framework, we propose a natural algorithm based on fitting Gaussian mixtures for selecting which points to label; though simple, the algorithm also has theoretical motivation. The proposed algorithm outperforms state-of-the-art active learning methods when used with various meta-learning algorithms across several benchmark datasets.


# 6
Strong Double Blind
MagMax: Leveraging Model Merging for Seamless Continual Learning

Daniel Marczak · Bartlomiej Twardowski · Tomasz Trzcinski · Sebastian Cygert

This paper introduces a continual learning approach named MagMax, which utilizes model merging to enable large pre-trained models to continuously learn from new data without forgetting previously acquired knowledge. Distinct from traditional continual learning methods that aim to reduce forgetting during task training, MagMax combines sequential fine-tuning with a maximum magnitude weight selection for effective knowledge integration across tasks. Our initial contribution is an extensive examination of model merging techniques, revealing that simple approaches like weight averaging and random weight selection surprisingly hold up well in various continual learning contexts. More importantly, we present MagMax, a novel model-merging strategy that enables continual learning of large pre-trained models for successive tasks. Our thorough evaluation demonstrates the superiority of MagMax in various scenarios, including class- and domain-incremental learning settings.


# 3
Strong Double Blind
Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning

JinYi Yoon · HyungJune Lee

With the explosion of edge intelligence, leveraging federated indirect knowledge has become crucial for boosting individual learners. However, the conventional approach to knowledge reuse often leads to catastrophic forgetting issues. In this paper, we revisit the concept of continual learning in the context of edge intelligence and address the knowledge transfer problem to enhance federated continual learning. Since each learner processes private heterogeneous data, we propose Pick-a-back, a device-to-device knowledge federation framework by selectively reusing the external knowledge with similar behavioral patterns. By borrowing indirect experiences, an edge device can initiate learning from useful knowledge and thus achieve faster yet more generalized knowledge acquisition. Using continual tasks consisting of various datasets on lightweight architectures, we validated that Pick-a-back provides a significant inference improvement of up to 8.0% via selective knowledge federation.


# 9
Strong Double Blind
Learning to Unlearn for Robust Machine Unlearning

Mark HUANG · Lin Geng Foo · Jun Liu

Machine unlearning (MU) seeks to remove knowledge of specific data samples from trained models without the necessity for complete retraining, a task made challenging by the dual objectives of effective erasure of data and maintaining the overall performance of the model. Despite recent advances in this field, balancing between the dual objectives of unlearning remains challenging. From a fresh perspective of generalization, we introduce a novel Learning-to-Unlearn (LTU) framework, which adopts a meta-learning approach to optimize the unlearning process to improve forgetting and remembering in a unified manner. LTU includes a meta-optimization scheme that facilitates models to effectively preserve generalizable knowledge with only a small subset of the remaining set, while thoroughly forgetting the specific data samples. We also introduce a Gradient Harmonization strategy to align the optimization trajectories for remembering and forgetting via mitigating gradient conflicts, thus ensuring efficient and effective model updates. Our approach demonstrates improved efficiency and efficacy for MU, offering a promising solution to the challenges of data rights and model reusability.


# 10
Strong Double Blind
UNIC: Universal Classification Models via Multi-teacher Distillation

Yannis Kalantidis · Larlus Diane · Mert Bulent SARIYILDIZ · Philippe Weinzaepfel · Thomas Lucas

Pretrained models have become a commodity and offer strong results on a broad range of tasks. As they resort to different learning strategies, they tend to be complementary. In this work, we focus on classification and seek to learn a unique encoder able to take from several of those pretrained models. We aim at even stronger generalization across a variety of classification tasks. We propose to learn such an encoder via multi-teacher distillation. We first thoroughly analyse standard distillation when driven by multiple strong teachers with complementary strengths. Guided by this analysis, we gradually propose improvements to the basic distillation setup. Among those, we enrich the architecture of the encoder with a ladder of expendable projectors, which increases the impact of intermediate features during distillation, and we introduce teacher dropping, a regularization mechanism that better balances the teachers' influence. Our final distillation strategy leads to student models of the same capacity as any of the teachers, while retaining or improving upon the performance of the best teacher for each task.


# 13
Strong Double Blind
Distributed Active Client Selection With Noisy Clients Using Model Association Scores

Kwang In Kim

ctive client selection (ACS) strategically identifies clients to participate in model updates during each training round of federated learning. In scenarios with limited communication resources, ACS emerges as a superior alternative to random client selection, significantly improving the convergence rate. However, current ACS methodologies face challenges in managing clients that provide erroneous model updates, such as those arising from noisy labels. To address this challenge, we introduce a new ACS algorithm tailored for scenarios with unknown erroneous clients. Our algorithm constructs a client sampling distribution based on the global association among model updates, which quantifies the ability of a client’s model update to align with updates from other clients. Leveraging these model associations, we efficiently identify clients whose updates may contain substantial errors, potentially disrupting the overall model training process. This approach is simple, computationally efficient, and eliminates the need for hyperparameter tuning. Our experiments, conducted on six benchmark datasets that encompass different types of erroneous and potentially malicious clients demonstrate that conventional ACS methods, not designed for erroneous clients, fail to outperform random selection. In contrast, our approach significantly enhances convergence speed while using the same communication resources.


# 218
Strong Double Blind
Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching

Ruonan Yu · Songhua Liu · Jingwen Ye · Xinchao Wang

Dataset distillation or condensation involves the synthesis of a large-scale dataset into a much smaller one, enabling models trained on this synthetic dataset to generalize effectively on real data. Tackling this challenge, as defined, relies on a bi-level optimization algorithm: a novel model is trained in each iteration within a nested loop, with gradients propagated through an unrolled computation graph. However, this approach incurs high memory and time complexity, posing difficulties in scaling up to large datasets such as ImageNet. Addressing these concerns, this paper introduces Teddy, a Taylor-approximated dataset distillation framework designed to handle large-scale dataset and enhance efficiency. On the one hand, backed up by theoretical analysis, we propose a memory-efficient approximation derived from Taylor expansion, which transforms the original form dependent on multi-step gradients to a first-order one. On the other hand, rather than repeatedly training a novel model in each iteration, we unveil that employing a pre-cached pool of weak models, which can be generated from a single base model, enhances both time efficiency and performance concurrently, particularly when dealing with large-scale datasets. Extensive experiments demonstrate that the proposed Teddy attains state-of-the-art efficiency and performance on the Tiny-ImageNet and original-sized ImageNet-1K dataset, notably surpassing prior methods by up to 12.8%, while reducing 46.6% runtime.


# 5
Strong Double Blind
FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning

Boyu Fan · Chenrui Wu · Xiang Su · Pan HUI

Despite extensive research into data heterogeneity in federated learning (FL), system heterogeneity remains a significant yet often overlooked challenge. Traditional FL approaches typically assume homogeneous hardware resources across FL clients, implying that clients can train a global model within a comparable time. However, in practical FL systems, clients often have heterogeneous resources, which impacts their capacity for training tasks. This discrepancy highlights the significance of exploring model-heterogeneous FL, a paradigm that allows clients to train different models based on their resource capabilities. To address this, we introduce FedTSA, a cluster-based two-stage aggregation method tailored for system heterogeneity in FL. FedTSA starts by clustering clients based on their capabilities, then conducts a two-stage aggregation, i.e., conventional weight averaging for homogeneous models as Stage 1, and deep mutual learning with a diffusion model for aggregating heterogeneous models as Stage 2. Extensive experiments not only show that FedTSA outperforms the baselines, but also explore various factors influencing model performance, thereby validating FedTSA as a promising approach for model-heterogeneous FL.


# 20
Strong Double Blind
Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge

Hyejin Park · Dongbo Min

In the realm of Adversarial Distillation (AD), strategic and precise knowledge transfer from an adversarially robust teacher model to a less robust student model is paramount. Our Dynamic Guidance Adversarial Distillation (DGAD) framework directly tackles the challenge of differential sample importance, with a keen focus on rectifying the teacher model's misclassifications. DGAD employs Misclassification-Aware Partitioning (MAP) to dynamically tailor the distillation focus, optimizing the learning process by steering towards the most reliable teacher predictions. Additionally, our Error-corrective Label Swapping (ELS) corrects teacher's misclassifications on both clean and adversarially perturbed inputs, refining the quality of knowledge transfer. Further, Predictive Consistency Regularization (PCR) guarantees consistent performance of the student model across both clean and adversarial inputs, significantly enhancing its overall robustness. By integrating these methodologies, DGAD significantly improves upon the accuracy of clean data and fortifies the model's defenses against sophisticated adversarial threats. Our experimental validation on CIFAR10, CIFAR100, and Tiny ImageNet datasets, employing various model architectures, demonstrates the efficacy of DGAD, establishing it as a promising approach for enhancing both the robustness and accuracy of student models in adversarial settings.


# 17
Strong Double Blind
Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting

Masoumeh Zareapoor · Pourya Shamsolmoali

Catastrophic overfitting (CO) poses a significant challenge to fast adversarial training (FastAT), particularly at large perturbation scales, leading to dramatic reductions in adversarial test accuracy. Our analysis of existing FastAT methods shows that CO is accompanied by abrupt and irregular fluctuations in loss convergence, indicating that a stable training dynamic is key to preventing CO. Therefore, we propose a training model that uses the Douglas-Rachford (DR) splitting technique to ensure a balanced and consistent training progression, effectively counteracting CO. The DR splitting technique, known for its ability to solve complex optimization problems, offering a distinct advantage over classical FastAT methods by providing a smoother loss convergence. This is achieved without resorting to complex regularization or incurring the computational costs associated with double backpropagation, presenting an efficient solution to enhance adversarial robustness. Our comprehensive evaluation conducted across standard datasets, demonstrates that our DR splitting-based model not only improves adversarial robustness but also achieves this with remarkable efficiency compared to various FastAT methods. This efficiency is particularly observed under conditions involving long training schedules and large adversarial perturbations.


# 19
A high-quality robust diffusion framework for corrupted dataset

Quan Dao · Binh Ta · Tung Pham · Anh Tran

Developing image-generative models, which are robust to outliers in the training process, has recently drawn attention from the research community. Due to the ease of integrating unbalanced optimal transport (UOT) into adversarial framework, existing works focus mainly on developing robust frameworks for generative adversarial model (GAN). Meanwhile, diffusion models have recently dominated GAN in various tasks and datasets. However, according to our knowledge, none of them are robust to corrupted datasets. Motivated by DDGAN, our work introduces the first robust-to-outlier diffusion. We suggest replacing the UOT-based generative model for GAN in DDGAN to learn the backward diffusion process. Additionally, we demonstrate that the Lipschitz property of divergence in our framework contributes to more stable training convergence. Remarkably, our method not only exhibits robustness to corrupted datasets but also achieves superior performance on clean datasets.


# 167
Similarity of Neural Architectures using Adversarial Attack Transferability

Jaehui Hwang · Dongyoon Han · Byeongho Heo · Song Park · Sanghyuk Chun · Jong-Seok Lee

In recent years, many deep neural architectures have been developed for image classification. Whether they are similar or dissimilar and what factors contribute to their (dis)similarities remains curious. To address this question, we aim to design a quantitative and scalable similarity measure between neural architectures. We propose Similarity by Attack Transferability (SAT) from the observation that adversarial attack transferability contains information related to input gradients and decision boundaries widely used to understand model behaviors. We conduct a large-scale analysis on 69 state-of-the-art ImageNet classifiers using our proposed similarity function to answer the question. In addition, we provide interesting insights into ML applications using multiple models, such as model ensemble and knowledge distillation. Our results show that using diverse neural architectures with distinct components can be beneficial to such scenarios.


# 164
Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data

Yuxuan Li · Sarthak Kumar Maharana · Yunhui Guo

With the increasing prevalence of Machine Learning as a Service (MLaaS) platforms, there is a growing focus on deep neural network (DNN) watermarking techniques. These methods are used to facilitate the verification of ownership for a target DNN model to protect intellectual property. One of the most widely employed watermarking techniques involves embedding a trigger set into the source model. Unfortunately, existing methodologies based on trigger sets are still susceptible to functionality-stealing attacks, potentially enabling adversaries to steal the functionality of the source model without a reliable means of verifying ownership. In this paper, we first introduce a novel perspective on trigger set-based watermarking methods from a feature learning perspective. Specifically, we demonstrate that by selecting data exhibiting multiple features, also referred to as \emph{multi-view data}, it becomes feasible to effectively defend functionality stealing attacks. Based on this perspective, we introduce a novel watermarking technique based on Multi-view dATa, called MAT, for efficiently embedding watermarks within DNNs. This approach involves constructing a trigger set with multi-view data and incorporating a simple feature-based regularization method for training the source model. We validate our method across various benchmarks and demonstrate its efficacy in defending against model extraction attacks, surpassing relevant baselines by a significant margin.


# 179
Resilience of Entropy Model in Distributed Neural Networks

Milin Zhang · Mohammad Abdi · Shahriar Rifat · Francesco Restuccia

Distributed deep neural networks (DNNs) have emerged as a key technique to reduce communication overhead without sacrificing performance in edge computing systems. Recently, entropy coding has been introduced to further reduce the communication overhead. The key idea is to train the distributed DNN jointly with an entropy model, which is used as side information during inference time to adaptively encode latent representations into bit streams with variable length. To the best of our knowledge, the resilience of entropy models is yet to be investigated. As such, in this paper we formulate and investigate the resilience of entropy models to intentional interference (e.g., adversarial attacks) and unintentional interference (e.g., weather changes and motion blur). Through an extensive experimental campaign with 3 different DNN architectures, 2 entropy models and 4 rate-distortion trade-off factors, we demonstrate that the entropy attacks can increase the communication overhead by up to 95%. By separating compression features in frequency and spatial domain, we propose a new defense mechanism that can reduce the transmission overhead of the attacked input by about 9% compared to unperturbed data, with only about 2% accuracy loss. Importantly, the proposed defense mechanism is a standalone approach which can be applied in conjunction with approaches such as adversarial training to further improve robustness. Code will be shared for reproducibility.


# 163
Strong Double Blind
WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning

Kunbei Cai · Zhenkai Zhang · Qian Lou · Fan Yao

Pre-trained models are widely used in machine learning (ML) due to the minimal demand for computational resources and training data. Recent studies show that the pre-trained model are vulnerable to backdoor attacks. Additionally, prior studies on hardware security have indicated that ML systems could potentially be compromised through bit flip attacks using Rowhammer. In this paper, we introduce \textbf{WBP} (i.e., weight bit poisoning), a novel attack framework that allows an attacker to implant a task-agnostic backdoor into the victim model \emph{during the fine-tuning process} through limited \emph{weight bit flips}. Notably, WBP aims to directly maximize the distance of output representations for normal and triggered inputs. We evaluate WBP on state-of-the-art CNNs and Vision Transformer models with a variety of downstream tasks. Our experimental results demonstrate that, without any prior knowledge of fine-tuning datasets, WBP can compromise a wide range of downstream tasks with a 99.3% attack success rate on average by flipping as few as 11 bits among millions of parameters.


# 235
Instant 3D Human Avatar Generation using Image Diffusion Models

Nikos Kolotouros · Thiemo Alldieck · Enric Corona · Eduard Gabriel Bazavan · Cristian Sminchisescu

We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple generation from 3D modeling which allow us to leverage powerful image synthesis priors, trained on billions of text-image pairs. We fine-tune latent diffusion networks with additional image conditioning to solve tasks such as image generation and back-view prediction, and to support qualitatively different multiple 3D hypotheses. Our partial fine-tuning approach allows to adapt the networks for each task without inducing catastrophic forgetting. In experiments, we demonstrate that our method produces accurate, high-quality 3D avatars with diverse appearance that respect the multimodal text, image, and body control signals. Our approach can produce a 3D mesh in as few as 2 seconds (four orders of magnitude speedup w.r.t. the vast majority of existing methods, most of them solving only a subset of our tasks, and with fewer controls), thus enabling applications that require the controlled 3D generation of human avatar at scale.