Track: Poster Session 4

# 210

Strong Double Blind

TimeLens-XL: Real-time Event-based Video Frame Interpolation with Large Motion

Shi Guo · Yutian Chen · Tianfan Xue · Jinwei Gu · Yongrui Ma

Video Frame Interpolation (VFI) aims to predict intermediate frames between consecutive low frame rate inputs. To handle the real-world complex motion between frames, event cameras, which capture high-frequency brightness changes at micro-second temporal resolution, are used to aid interpolation, denoted as Event-VFI. One critical step of Event-VFI is optical flow estimation. Prior methods that adopt either a two-segment formulation or a parametric trajectory model cannot correctly recover large and complex motions between frames, which suffer from accumulated error in flow estimation. To solve this problem, we propose TimeLens-XL, a physically grounded lightweight network that decomposes large motion between two frames into a sequence of small motions for better accuracy. It estimates the entire motion trajectory recursively and samples the bi-directional flow for VFI. Benefiting from the accurate and robust flow prediction, intermediate frames can be efficiently synthesized with simple warping and blending. As a result, the network is extremely lightweight, with only 1/5~1/10 computational cost and model size of prior works, while also achieving state-of-the-art performance on several challenging benchmarks. To our knowledge, TimeLens-XL is the first real-time (27fps) Event-VFI algorithm at a resolution of 1280x720 using a single RTX 3090 GPU. Furthermore, we have collected a new RGB+Event dataset (HQ-EVFI) consisting of more than 100 challenging scenes with large complex motions and accurately synchronized high-quality RGB-EVS streams. HQ-EVFI addresses several limitations presented in prior datasets and can serve as a new benchmark. Both the code and dataset will be released upon publication.

# 220

Strong Double Blind

MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition

Aggelina Chatziagapi · Grigorios Chrysos · Dimitris Samaras

We introduce MIGS (multi-identity Gaussian splatting), a novel method that learns a single neural representation for multiple identities, using only monocular videos. Recent 3D Gaussian Splatting (3DGS) approaches for human avatars require per-identity optimization. However, learning a multi-identity representation presents advantages in robustly animating humans under arbitrary poses. We propose to construct a high-order tensor that combines all the learnable parameters of our 3DGS representation for all the training identities. By factorizing the tensor, we model the complex rigid and non-rigid deformations of multiple human subjects in a unified network using a reduced number of parameters. Our proposed approach leverages information from all the training identities, enabling robust animation under challenging unseen poses, outperforming existing approaches. We also demonstrate how it can be extended to learn unseen identities.

# 151

RaFE: Generative Radiance Fields Restoration

Zhongkai Wu · Ziyu Wan · Jing Zhang · Jing Liao · Dong Xu

NeRF (Neural Radiance Fields) has demonstrated tremendous potential in novel view synthesis and 3D reconstruction, but its performance is sensitive to input image quality, which struggles to achieve high-fidelity rendering when provided with low-quality sparse input viewpoints. Previous methods for NeRF restoration are tailored for specific degradation type, ignoring the generality of restoration. To overcome this limitation, we propose a generic radiance fields restoration pipeline, named RaFE, which applies to various types of degradations, such as low resolution, blurriness, noise, JPEG compression artifacts, or their combinations. Our approach leverages the success of off-the-shelf 2D restoration methods to recover the multi-view images individually. Instead of reconstructing a blurred NeRF by averaging inconsistencies, we introduce a novel approach using Adversarial Generative Networks (GANs) for NeRF generation to better accommodate the geometric and appearance inconsistencies present in the multi-view images. Specifically, we adopt a two-level triplane architecture, where the coarse level remains fixed to represent the low-quality NeRF, and a fine-level residual triplane to be added to the coarse level is modeled as a distribution with GAN to capture potential variations in restoration. We validate our proposed method on both synthetic and real cases for various restoration tasks, demonstrating superior performance in both quantitative and qualitative evaluations, surpassing other 3D restoration methods specific to single tasks.

# 150

Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration

Zhihao Liang · Qi Zhang · WENBO HU · Ying Feng · Lei ZHU · Kui Jia

The 3D Gaussian Splatting (3DGS) gained its popularity recently by combining the advantages of both primitive-based and volumetric 3D representations, resulting in improved quality and efficiency for 3D scene rendering. However, 3DGS is not alias-free, and its rendering at varying resolutions could produce severe blurring or jaggies. This is because 3DGS treats each pixel as an isolated, single point rather than as an area, causing insensitivity to changes in the footprints of pixels. Consequently, this discrete sampling scheme inevitably results in aliasing, owing to the restricted sampling bandwidth. In this paper, we derive an analytical solution to address this issue. More specifically, we use a conditioned logistic function as the analytic approximation of the cumulative distribution function (CDF) in a one-dimensional Gaussian signal and calculate the Gaussian integral by subtracting the CDFs. We then introduce this approximation in the two-dimensional pixel shading, and present Analytic-Splatting, which analytically approximates the Gaussian integral within the 2D-pixel window area to better capture the intensity response of each pixel. Moreover, we use the approximated response of the pixel window integral area to participate in the transmittance calculation of volume rendering, making Analytic-Splatting sensitive to the changes in pixel footprint at different resolutions. Experiments on various datasets validate that our approach has better anti-aliasing capability that gives more details and better fidelity.

# 155

FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information

Wen Jiang · BOSHU LEI · Kostas Daniilidis

This study addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields. Neural Radiance Fields (NeRF) have greatly advanced image rendering and reconstruction, but the cost of acquiring images poses the need to select the most informative viewpoints efficiently. Existing approaches depend on modifying the model architecture or hypothetical perturbation field to indirectly approximate the model uncertainty. However, selecting views from indirect approximation does not guarantee optimal information gain for the model. By leveraging Fisher Information, we directly quantify observed information on the parameters of Radiance Fields and select candidate views by maximizing the Expected Information Gain~(EIG). Our method achieves state-of-the-art results on multiple tasks, including view selection, active mapping, and uncertainty quantification, demonstrating its potential to advance the field of Radiance Fields. Our method with the 3D Gaussian Splatting backend could perform view selections at 70~fps.

# 153

Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

Yonggan Fu · Huaizhi Qu · Zhifan Ye · Chaojian Li · Kevin Zhao · Yingyan Lin

Recent breakthroughs in Neural Radiance Fields (NeRFs) have sparked significant demand for their integration into real-world 3D applications. However, the varied functionalities required by different 3D applications often necessitate diverse NeRF models with various pipelines, leading to tedious NeRF training for each target task and cumbersome trial-and-error experiments. Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. We achieve this by proposing a framework called Omni-Recon, which is capable of (1) generalizable 3D reconstruction and zero-shot multitask scene understanding, and (2) adaptability to diverse downstream 3D applications such as real-time rendering and scene editing. Our key insight is that an image-based rendering pipeline, with accurate geometry and appearance estimation, can lift 2D image features into their 3D counterparts, thus extending widely explored 2D tasks to the 3D world in a generalizable manner. Specifically, our Omni-Recon features a general-purpose NeRF model using image-based rendering with two decoupled branches: one complex transformer-based branch that progressively fuses geometry and appearance features for accurate geometry estimation, and one lightweight branch for predicting blending weights of source views. This design achieves state-of-the-art (SOTA) generalizable 3D surface reconstruction quality with blending weights reusable across diverse tasks for zero-shot multitask scene understanding. In addition, it can enable real-time rendering after baking the complex geometry branch into meshes, swift adaptation to achieve SOTA generalizable 3D understanding performance, and seamless integration with 2D diffusion models for text-guided 3D editing. All code will be released upon acceptance.

# 152

RPBG: Towards Robust Neural Point-based Graphics in the Wild

Qingtian Zhu · Zizhuang Wei · Zhongtian Zheng · Yifan Zhan · Zhuyu Yao · Jiawang Zhang · Kejian Wu · zheng yinqiang

Point-based representations have recently gained popularity in novel view synthesis, for their unique advantages, e.g., intuitive geometric representation, simple manipulation, and faster convergence. However, based on our observation, these point-based neural re-rendering methods are only expected to perform well under ideal conditions and suffer from noisy, patchy points and unbounded scenes, which are challenging to handle but defacto common in real applications. To this end, we revisit one such influential method, known as Neural Point-based Graphics (NPBG), as our baseline, and propose Robust Point-based Graphics (RPBG). We in-depth analyze the factors that prevent NPBG from achieving satisfactory renderings on generic datasets, and accordingly reform the pipeline to make it more robust to varying datasets in-the-wild. Inspired by the practices in image restoration, we greatly enhance the neural renderer to enable the attention-based correction of point visibility and the inpainting of incomplete rasterization, with only acceptable overheads. We also seek for a simple and lightweight alternative for environment modeling and an iterative method to alleviate the problem of poor geometry. By thorough evaluation on a wide range of datasets with different shooting conditions and camera trajectories, RPBG stably outperforms the baseline by a large margin, and exhibits its great robustness over state-of-the-art NeRF-based variants. Code is provided in the Supplementary Material.

# 157

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Yuedong Chen · Haofei Xu · Chuanxia Zheng · Bohan Zhuang · Marc Pollefeys · Andreas Geiger · Tat-Jen Cham · Jianfei Cai

We propose MVSplat, an efficient feed-forward 3D Gaussian Splatting model learned from sparse multi-view images. To accurately localize the Gaussian centers, we propose to build a cost volume representation via plane sweeping in the 3D space, where the cross-view feature similarities stored in the cost volume can provide valuable geometry cues to the estimation of depth. We learn the Gaussian primitives' opacities, covariances, and spherical harmonics coefficients jointly with the Gaussian centers while only relying on photometric supervision. We demonstrate the importance of the cost volume representation in learning feed-forward Gaussian Splatting models via extensive experimental evaluations. On the large-scale RealEstate10K and ACID benchmarks, our model achieves state-of-the-art performance with the fastest feed-forward inference speed (22~fps). Compared to the latest state-of-the-art method pixelSplat, our model uses 10 times fewer parameters and infers more than 2 times faster while providing higher appearance and geometry quality as well as better cross-dataset generalization.

# 146

Learning 3D-aware GANs from Unposed Images with Template Feature Field

XINYA CHEN · Hanlei Guo · Yanrui Bin · Shangzhan Zhang · Yuanbo Yang · Yujun Shen · Yue Wang · Yiyi Liao

Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TEFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives. Code and models will be made public.

# 156

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Basile Van Hoorick · Rundi Wu · Ege Ozguroglu · Kyle Sargent · Ruoshi Liu · Pavel Tokmakov · Achal Dave · Changxi Zheng · Carl Vondrick

Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose GCD, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality. Project webpage: https://gcd.cs.columbia.edu/

# 147

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Ashkan Mirzaei · Tristan T Aumentado-Armstrong · Marcus A Brubaker · Jonathan Kelly · Alex Levinshtein · Konstantinos Derpanis · Igor Gilitschenski

The success of denoising diffusion models in generating and editing images has sparked interest in using diffusion models for editing 3D scenes represented via neural radiance fields (NeRFs). However, current 3D editing methods lack a way to both pinpoint the edit location and limit changes to the desired volumetric region. Consequently, these methods often over-edit, altering irrelevant parts of the scene. We introduce a new task, 3D edit localization, to automatically identify the relevant region for an editing task and restrict the edit accordingly. To achieve this goal, we initially tackle 2D edit localization, and then lift it to multiple views to address the 3D localization challenge. For 2D localization, we leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. We refer to this discrepancy as the relevance map. The relevance map conveys the importance of changing each pixel to achieve an edit, and guides downstream modifications, ensuring that pixels irrelevant to the edit remain unchanged. With the relevance maps of multiview posed images, we can define the \textit{relevance field}, defining the 3D region within which modifications should be made. This enables us to improve the quality of text-guided 3D NeRF scene editing, by performing iterative updates on the training views, guided by renders from the relevance field. Our method achieves state-of-the-art performance on both NeRF and image editing tasks. We will make the code available.

# 149

Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering

Antoine Guedon · Vincent Lepetit

We propose Gaussian Frosting, a novel mesh-based representation for high-quality rendering and editing of complex 3D effects in real-time. Our approach builds on the recent 3D Gaussian Splatting framework, which optimizes a set of 3D Gaussians to approximate a radiance field from images. We propose first extracting a base mesh from Gaussians during optimization, then building and refining an adaptive layer of Gaussians with a variable thickness around the mesh to better capture the fine details and volumetric effects near the surface, such as hair or grass. We call this layer Gaussian Frosting, as it resembles a coating of frosting on a cake. The fuzzier the material, the thicker the frosting. We also introduce a parameterization of the Gaussians to enforce them to stay inside the frosting layer and automatically adjust their parameters when deforming, rescaling, editing or animating the mesh. Our representation allows for efficient rendering using Gaussian splatting, as well as editing and animation by modifying the base mesh. We demonstrate the effectiveness of our method on various synthetic and real scenes, and show that it outperforms existing surface-based approaches. We will release our code and a web-based viewer as additional contributions.

# 202

Strong Double Blind

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

Lin Zhu · Yunlong Zheng · Yijun Zhang · Xiao Wang · Lizhi Wang · Hua Huang

Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities. However, current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene, resulting in over-smoothing and blurry artifacts. Addressing this challenge necessitates the integration of conditional information, encompassing temporal features, low-frequency texture, and high-frequency events, to guide the Denoising Diffusion Probabilistic Model (DDPM) in producing accurate and natural outputs. To tackle this issue, we introduce a novel approach, the Temporal Residual Guided Diffusion Framework, which effectively leverages both temporal and frequency-based event priors. Our framework incorporates three key conditioning modules: a pre-trained low-frequency intensity estimation module, a temporal recurrent encoder module, and an attention-based high-frequency prior enhancement module. In order to capture temporal scene variations from the events at the current moment, we employ a temporal-domain residual image as the target for the diffusion model. Through the combination of these three conditioning paths and the temporal residual framework, our framework excels in reconstructing high-quality videos from event flow, mitigating issues such as artifacts and over-smoothing commonly observed in previous approaches. Extensive experiments conducted on multiple benchmark datasets validate the superior performance of our framework compared to prior event-based reconstruction methods. Our code will be released upon acceptance.

# 229

Strong Double Blind

ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model

Fu-Yun Wang · Zhaoyang Huang · Qiang Ma · Guanglu Song · Xudong LU · Weikang Bian · Yijin Li · Yu Liu · Hongsheng LI

Although video generation has made great progress in capacity and controllability and is gaining increasing attention, currently available video generation models still make minimal progress in the video length they can generate. Due to the lack of well-annotated long video data, high training/inference cost, and flaws in the model designs, current video generation models can only generate videos of $2 \sim 4$ seconds, greatly limiting their applications and the creativity of users. We present ZoLA, a zero-shot method for creative long animation generation with short video diffusion models and even with short video consistency models~(a new family of generative models known for the fast generation with top-performing quality). In addition to the extension for long animation generation~(dozens of seconds), ZoLA as a zero-shot method, can be easily combined with existing community adapters~(developed only for image or short video models) for more innovative generation results, including control-guided animation generation/editing, motion customization/alternation, and multi-prompt conditioned animation generation, etc. And, importantly, all of these can be done with commonly affordable GPU~(12 GB for 32-second animations) and inference time~(90 seconds for denoising 32-second animations with consistency models). Experiments validate the effectiveness of ZoLA, bringing great potential for creative long animation generation.

# 236

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Jinbo Xing · Menghan Xia · Yong Zhang · Haoxin Chen · Wangbo Yu · Hanyuan Liu · Gongye Liu · Xintao Wang · Ying Shan · Tien-Tsin Wong

Animating a still image offers an engaging visual experience. Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, we first project it into a text-aligned rich context representation space using a query transformer, which facilitates the video model to digest the image content in a compatible fashion. However, some visual details still struggle to be preserved in the resultant videos. To supplement with more precise image information, we further feed the full image to the diffusion model by concatenating it with the initial noises. Experimental results show that our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image. Comparative evaluation demonstrates the notable superiority of our approach over existing competitors. The source code will be released upon publication.

# 208

Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

Zhihang Zhong · Gurunandan Krishnan · Xiao Sun · Yu Qiao · Sizhuo Ma · Jian Wang

Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements. Given two images of a baseball, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly together with predicting the frames, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed "distance indexing". This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. We further observed that, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames (i.e., halfway in-between), due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly sharper outputs and superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing. Additionally, distance indexing can be specified pixel-wise, which enables temporal manipulation of each object independently, offering a novel tool for video editing tasks like re-timing.

# 233

Strong Double Blind

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan · Zhiliang Xu · Hang Zhou · Kaisiyuan Wang · Shengyi He · Zhanwang Zhang · Borong Liang · Haocheng Feng · Errui Ding · Jingtuo Liu · Jingdong Wang · Youjian Zhao · Ziwei Liu

Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio in real-time, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.

# 242

Video Editing via Factorized Diffusion Distillation

Uriel Singer · Amit Zohar · Yuval Kirstain · Shelly Sheynin · Adam Polyak · Devi Parikh · Yaniv Taigman

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters.

# 205

Strong Double Blind

Efficient Neural Video Representation with Temporally Coherent Modulation

Seungjun Shin · Suji Kim · Dokwan Oh

Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach [26] employs grid-type trainable parameters and successfully achieves a faster encoding speed in comparison to its predecessors [5]. Despite its time efficiency, using grid-types without considering the dynamic nature of the videos results in performance limitations. To enable learning video representation rapidly and effectively, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture the dynamic characteristics by decomposing the spatio-temporal 3D video data into a set of 2D grids. Through this mapping, our framework enables to process temporally corresponding pixels at once, resulting in a more than 3× faster video encoding speed for a reasonable video quality. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG datasets (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV dataset, compared to previous work. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting.

# 154

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Vikram Voleti · Chun-Han Yao · Mark Boss · Adam Letts · David Pankratz · Dmitrii Tochilkin · Christian Laforte · Robin Rombach · Varun Jampani

We present Stable Video 3D (SV3D) --- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.

# 240

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

Bolin Lai · Xiaoliang Dai · Lawrence Chen · Guan Pang · James Rehg · Miao Liu

Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures the user's environment. Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, the existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap. To this end, we propose to Learn EGOcentric (LEGO) action frame generation via visual instruction tuning. First, we introduce a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning. Then we propose a novel method to leverage image and text embeddings from VLLM as additional conditioning to improve the performance of a diffusion model. We validate our model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights in our method.

# 176

Strong Double Blind

NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction

Dong Wei · Huaijiang Sun · Xiaoning Sun · Shengxiang Hu

Predicting accurate future human poses from historically observed motions remains a challenging task due to the spatial-temporal complexity and continuity of motions. Previous historical-value methods typically interpret motion as discrete consecutive frames, which neglects the continuous temporal dynamics and impedes the capability of handling incomplete observations (with missing values). In this paper, we propose an implicit Neural Representation method for human Motion prediction, dubbed NeRMo, which represents the motion as a continuous function parameterized by a neural network. The core idea is to design a new coordinate system where NeRMo takes joint-time index as input and outputs the corresponding 3D skeleton position. This separate and flexible treatment of space and time allows NeRMo to combine the following advantages. It extrapolates at arbitrary body joints and temporal locations; it can learn from both complete and incomplete observed past motions; it provides a unified framework for repairing missing values and forecasting future poses using a single trained model. In addition, we show that NeRMo exhibits compatibility with meta-learning methods, enabling it to effectively generalize to unseen time steps. Extensive experiments conducted on classical benchmarks have confirmed the superior prediction performance of our joint-time index method compared to existing historical-value baselines.

# 217

UGG: Unified Generative Grasping

Jiaxin Lu · Hao Kang · Haoxiang Li · Bo Liu · Yiding Yang · Qixing Huang · Gang Hua

Dexterous grasping aims to produce diverse grasping postures with a high grasping success rate. Regression-based methods that directly predict grasping parameters given the object may achieve a high success rate but often lack diversity. Generation-based methods that generate grasping postures conditioned on the object can often produce diverse grasping, but they are insufficient for high grasping success due to lack of discriminative information. To mitigate, we introduce a unified diffusion-based dexterous grasp generation model, dubbed the name UGG, which operates within the object point cloud and hand parameter spaces. Our all-transformer architecture unifies the information from the object, the hand, and the contacts, introducing a novel representation of contact points for improved contact modeling. The flexibility and quality of our model enable the integration of a lightweight discriminator, benefiting from simulated discriminative data, which pushes for a high success rate while preserving high diversity. Beyond grasp generation, our model can also generate objects based on hand information, offering valuable insights into object design and studying how the generative model perceives objects. Our model achieves state-of-the-art dexterous grasping on the large-scale DexGraspNet dataset while facilitating human-centric object design, marking a significant advancement in dexterous grasping research.

# 145

Strong Double Blind

LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment

Yiming Ren · Xiao Han · Yichen Yao · Xiaoxiao Long · Yujing Sun · Yuexin Ma

LiDAR-based human motion capture has garnered significant interest in recent years for its practicability in large-scale and unconstrained environments. However, most methods rely on cleanly segmented human point clouds as input, the accuracy and smoothness of their motion results are compromised when faced with noisy data, rendering them unsuitable for practical applications. To address these limitations and enhance the robustness and precision of motion capture with noise interference, we introduce LiveHPS++, an innovative and effective solution based on a single LiDAR system. Benefiting from three meticulously designed modules, our method can learn dynamic and kinematic features from human movements, and further enable the precise capture of coherent human motions in open settings, making it highly applicable to real-world scenarios. Through extensive experiments, LiveHPS++ has proven to significantly surpass existing state-of-the-art methods across various datasets, establishing a new benchmark in the field.

# 224

Controllable Human-Object Interaction Synthesis

Jiaman Li · Alexander Clegg · Roozbeh Mottaghi · Jiajun Wu · Xavier Puig · Karen Liu

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints; it also cannot ensure the realism of interactions that require precise hand-object and human-floor contact. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints; we also design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model. We demonstrate that our learned interaction module can synthesize realistic human-object interactions, adhering to provided textual descriptions and sparse waypoint conditions. Additionally, our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.

# 148

Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

Hyeonwoo Kim · Sookwan Han · Patrick Kwon · Hanbyul Joo

Understanding the inherent human knowledge in interacting with a given environment (e.g., affordance) is essential for improving AI to better assist humans. While existing approaches primarily focus on human-object contacts during interactions, such affordance representation cannot fully address other important aspects of human-object interactions (HOIs), i.e., patterns of relative positions and orientations. In this paper, we introduce a novel affordance representation, named Comprehensive Affordance (ComA). Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes, capturing plausible patterns of contact, relative orientations, and spatial relationships. To construct the distribution, we present a novel pipeline that synthesizes diverse and realistic 3D HOI samples given any 3D target object mesh. The pipeline leverages a pre-trained 2D inpainting diffusion model to generate HOI images from object renderings and lifts them into 3D. To avoid the generation of false affordances, we propose a new inpainting framework, Adaptive Mask Inpainting. Since ComA is built on synthetic samples, it can extend to any object in an unbounded manner. Through extensive experiments, we demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance. Importantly, we also showcase the potential of ComA to reconstruct human-object interactions in 3D through an optimization framework, highlighting its advantage in incorporating both contact and non-contact properties.

# 73

Strong Double Blind

Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation

Duo Peng · Zhengbo Zhang · Ping Hu · Qiuhong Ke · David Yau · Jun Liu

Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints of an arbitrary unseen category in images, based on several provided examples of that category. This is a challenging task, as the limited data of unseen categories makes it difficult for models to generalize effectively. To address this challenge, previous methods typically train models on a set of predefined base categories with extensive annotations. In this work, we propose to harness rich knowledge in the off-the-shelf text-to-image diffusion model to effectively address CAPE, without training on carefully prepared base categories. To this end, we propose a Prompt Pose Matching (PPM) framework, which learns pseudo prompts corresponding to the keypoints in the provided few-shot examples via the text-to-image diffusion model. These learned pseudo prompts capture semantic information of keypoints, which can then be used to locate the same type of keypoints from images. We also design a Category-shared Prompt Training (CPT) scheme, to further boost our PPM's performance. Extensive experiments demonstrate the efficacy of our approach.

# 211

Strong Double Blind

POET: Prompt Offset Tuning for Continual Human Action Adaptation

Prachi Garg · Joseph K J · Vineeth N Balasubramanian · Necati Cihan Camgoz · Chengde Wan · Kenrick Kin · Weiguang Si · Shugao Ma · Fernando de la Torre

As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy aware few-shot continual action recognition. Towards this end, we propose POET: Prompt Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt selection approach, and are the first to apply this prompting technique to Graph Neural Networks. To evaluate our method, we introduce two new benchmarks: (i) NTU RGB+D dataset for activity recognition (ii) SHREC-2017 dataset for hand gesture recognition. The code will be released upon acceptance.

# 219

Strong Double Blind

NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

Zhongqun Zhang · Hengfei Wang · Ziwei Yu · Yihua Cheng · Angela Yao · Hyung Jin Chang

Modeling the physical contacts between the hand and object is standard for refining inaccurate hand poses and generating novel human grasp in 3D hand-object reconstruction. However, existing methods rely on geometric constraints that cannot be specified or controlled. This paper introduces a novel task of controllable 3D hand-object contact modeling with natural language descriptions. Challenges include i) the complexity of cross-modal modeling from language to contact, and ii) a lack of descriptive text for contact patterns. To address these issues, we propose NL2Contact, a model that generates controllable contacts by leveraging staged diffusion models. Provided with a language description of the hand and contact, NL2Contact generates realistic and faithful 3D hand-object contacts. To train the model, we build ContactDescribe, the first dataset with hand-centered contact descriptions. It contains multi-level and diverse descriptions generated by large language models, based on carefully designed prompts (e.g. grasp action, grasp type, contact location, free finger status). We show applications of our model to grasp pose optimization and novel human grasp generation, both based on a textual contact description.

# 249

Strong Double Blind

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

Junho Park · Kyeongbo Kong · Suk-Ju Kang

Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt.) These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.

# 113

Strong Double Blind

Sapiens: Foundation for Human Vision Models

Rawal Khirodkar · Timur Bagautdinov · Julieta Martinez · Zhaoen Su · Austin T James · Peter Selednik · Stuart Anderson · Shunsuke Saito

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning foundational models pretrained on over 300 million in-the-wild human images. Our key insight is that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. We demonstrate that resulting foundational models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks significantly improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing complex baselines across various human-centric benchmarks. Specifically, we achieve significant improvements over the prior state-of-the-art on COCO-Wholebody (pose) by 7.9 mAP, CIHP (part-seg) by 1.3 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

# 226

Strong Double Blind

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Zhihao Xu · Shengjie Gong · Jiapeng Tang · Lingyu Liang · Yining Huang · Haojie Li · Shuangping Huang

We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Despite recent advancements in data-driven techniques, accurately mapping between audio signals and 3D facial meshes remains challenging. Direct regression of the entire sequence often leads to over-smoothed results due to the ill-posed nature of the problem. To this end, we propose a progressive learning mechanism that generates 3D facial animations by introducing key motion capture to decrease cross-modal mapping uncertainty and learning complexity. Concretely, our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The former identifies key motions and learns the associated 3D facial expressions, ensuring accurate lip-speech synchronization. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency. Extensive experimental comparisons against existing state-of-the-art methods demonstrate the superiority of our approach in generating more vivid and consistent talking face animations. Consistent enhancements in results through the integration of our proposed scheme with existing methods underscore the efficacy of our approach.

# 238

Strong Double Blind

Modeling and Driving Human Body Soundfields through Acoustic Primitives

Chao Huang · Dejan Markovic · Chenliang Xu · Alexander Richard

While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.

# 237

Strong Double Blind

Let the Avatar Talk using Texts without Paired Training Data

Xiuzhe Wu · Yang-Tian Sun · Handi Chen · Hang Zhou · Jingdong Wang · Zhengzhe Liu · Qi Xiaojuan

This paper introduces text-driven talking avatar generation, a new task that uses text to instruct both the generation and animation of an avatar. One significant obstacle in this task is the absence of paired text and talking avatar data for model training, limiting data-driven methodologies. To this end, we present a zero-shot approach that adapts an existing 3D-aware image generation model, trained on a large-scale image dataset for high-quality avatar creation, to align with textual instructions and be animated to produce talking avatars, eliminating the need for paired text and talking avatar data. Our approach's core lies in the seamless integration of a 3D-aware image generation model (i.e., EG3D), the explicit 3DMM model, and a newly developed self-supervised inpainting technique, to create and animate the avatar and generate a temporal consistent talking video. Thorough evaluations demonstrate the effectiveness of our proposed approach in generating realistic avatars based on textual descriptions and empowering avatars to express user-specified text. Notably, our approach is highly controllable and can generate rich expressions and head poses.

# 277

Strong Double Blind

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

Jisu Shin · Junmyeong Lee · Seongmin Lee · Min-Gyu Park · Jumi Kang · Ju Hong Yoon · HAE-GON JEON

We present a novel framework for reconstructing animatable human avatars from multiple images, termed CanonicalFusion. Our central concept involves integrating individual reconstruction results into the canonical space. To be specific, we first predict Linear Blend Skinning (LBS) weight maps and depth maps using a shared-encoder-dual-decoder network, enabling direct canonicalization of the 3D mesh from the predicted depth maps. Here, instead of predicting high-dimensional skinning weights, we infer compressed skinning weights, i.e., 3-dimensional vector, with the aid of pre-trained MLP networks. We also introduce a forward skinning-based differentiable rendering scheme to merge the reconstructed results from multiple images. This scheme refines the initial mesh by reposing the canonical mesh via the forward skinning and by minimizing photometric and geometric errors between the rendered and the predicted results. Our optimization scheme considers the position and color of vertices as well as the joint angles for each image, thereby mitigating the negative effects of pose errors. We conduct extensive experiments to demonstrate the effectiveness of our method and compare our CanonicalFusion with state-of-the-art methods. Lastly, we provide easy-to-understand insight for the implementation of our work by submitting our source code as supplementary material.

# 261

Relightable Neural Actor with Intrinsic Decomposition and Pose Control

Diogo Carbonera Luvizon · Vladislav Golyanik · Adam Kortylewski · Marc Habermann · Christian Theobalt

Creating a controllable and relightable digital avatar from multi-view video with fixed illumination is a very challenging problem since humans are highly articulated, creating pose-dependent appearance effects, and skin as well as clothing require space-varying BRDF modeling. Existing works on creating animatible avatars either to not focus on relighting at all, require controlled illumination setups, or try to recover a relightable avatar from very low cost setups, i.e. a single RGB video, at the cost of severely limited result quality, e.g. shadows not even being modeled. To address this, we propose Relightable Neural Actor, a new video-based method for learning a pose-driven neural human model that can be relighted, allows appearance editing, and models pose-dependent effects such as wrinkles and self-shadows. Importantly, for training, our method solely requires a multi-view recording of the human under a known, but static lighting condition. To tackle this challenging problem, we leverage an implicit geometry representation of the actor with a drivable density field that models pose-dependent deformations and derive a dynamic mapping between 3D and UV spaces, where normal, visibility, and materials are effectively encoded. To evaluate our approach in real-world scenarios, we collect a new dataset with four identities recorded under different light conditions, indoors and outdoors, providing the first benchmark of its kind for human relighting, and demonstrating state-of-the-art relighting results for novel human poses.

# 311

3R-INN: How to be climate friendly while consuming/delivering videos?

ZOUBIDA AMEUR · Claire-Helene Demarty · Olivier LE MEUR · Daniel Menard

The consumption of a video requires a considerable amount of energy during the various stages of its life-cycle. With a billion hours of video consumed daily, this contributes significantly to the greenhouse gas emission. Therefore, reducing the end-to-end carbon footprint of the video chain, while preserving the quality of experience at the user side, is of high importance. To contribute in an impactful manner, we propose 3R-INN, a single light invertible network that does three tasks at once: given a high-resolution grainy image, it Rescales it to a lower resolution, Removes film grain and Reduces its power consumption when displayed. Providing such a minimum viable quality content contributes to reducing the energy consumption during encoding, transmission, decoding and display. 3R-INN also offers the possibility to restore either the high-resolution grainy original image or a grain-free version, thanks to its invertibility and the disentanglement of the high frequency, and without transmitting auxiliary data. Experiments show that, while enabling significant energy savings for encoding (78%), decoding (77%) and rendering (5% to 20%), 3R-INN outperforms state-of-the-art film grain synthesis and energy-aware methods and achieves state-of-the-art performance on the rescaling task on different test-sets.

# 308

Strong Double Blind

Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement

Kun Zhou · Xinyu Lin · Wenbo Li · Xiaogang Xu · Yuanhao Cai · Zhonghang Liu · XIAOGUANG HAN · Jiangbo Lu

Previous low-light image enhancement (LLIE) approaches, while employing frequency decomposition techniques to address the intertwined challenges of low frequency (e.g., illumination recovery) and high frequency (e.g., noise reduction), primarily focused on the development of dedicated and complex networks to achieve improved performance. In contrast, we reveal that an advanced disentanglement paradigm is sufficient to consistently enhance state-of-the-art methods with minimal computational overhead. Leveraging the image Laplace decomposition scheme, we propose a novel low-frequency consistency method, facilitating improved frequency disentanglement optimization. Our method, seamlessly integrating with various models such as CNNs, Transformers, and flow-based and diffusion models, demonstrates remarkable adaptability. Noteworthy improvements are showcased across five popular benchmarks, with up to 7.68dB gains on PSNR achieved for six state-of-the-art models. Impressively, our approach maintains efficiency with only 88K extra parameters, setting a new standard in the challenging realm of low-light image enhancement.

# 310

Strong Double Blind

Intrinsic Single-Image HDR Reconstruction

Sebastian Dille · Chris Careaga · Yagiz Aksoy

Recovering the high dynamic range of a natural scene from a single low dynamic range image is an important task with many applications in photography, image editing, and photo-realistic rendering. The wide range of naturally occurring illumination conditions makes their reconstruction from clipped and compressed RGB values challenging. We present a novel approach for single-image HDR reconstruction in the intrinsic domain. By decomposing the image into diffuse reflectance and illumination layers, we can divide the reconstruction problem into two simpler subtasks that we address individually. Our approach generates faithful illumination levels by learning to reconstruct HDR shading as well as recovering clipped color information by separately reconstructing the albedo.

# 339

Domain Reduction Strategy for Non-Line-of-Sight Imaging

Hyunbo Shim · In Cho · Daekyu Kwon · Seon Joo Kim

This paper presents a novel optimization-based method for non-line-of-sight (NLOS) imaging that aims to reconstruct hidden scenes under general setups with significantly reduced reconstruction time. In NLOS imaging, the visible surfaces of the target objects are notably sparse. To mitigate unnecessary computations arising from empty regions, we design our method to render the transients through partial propagations from a continuously sampled set of points from the hidden space. Our method is capable of accurately and efficiently modeling the view-dependent reflectance using surface normals, which enables us to obtain surface geometry as well as albedo. In this pipeline, we propose a novel domain reduction strategy to eliminate superfluous computations in empty regions. During the optimization process, our domain reduction procedure periodically prunes the empty regions from our sampling domain in a coarse-to-fine manner, leading to substantial improvement in efficiency. We demonstrate the effectiveness of our method in various NLOS scenarios with sparse scanning patterns. Experiments conducted on both synthetic and real-world data clearly support the efficacy and efficiency of the proposed method in general NLOS scenarios.

# 323

Strong Double Blind

Synthesizing Time-varying BRDFs via Latent Space

Takuto Narumoto · Hiroaki Santo · Fumio Okura

This paper introduces a method for synthesizing time-varying bidirectional reflectance distribution functions (BRDFs) by applying learned temporal changes to static BRDFs. Achieving realistic and natural changes in material appearance over time is crucial in computer graphics and virtual reality. Existing methods employ a parametric BRDF model, and the temporal changes in BRDFs are modeled by polynomial functions that represent the transitions of the BRDF parameters. However, the limited representational capabilities of both the parametric BRDF model and the polynomial temporal model restrict the fidelity of the appearance reproduction. In this paper, to overcome this limitation, we introduce a neural embedding for BRDFs and propose a neural temporal model that represents the temporal changes of BRDFs in the latent space, which allows flexible representations of BRDFs and temporal changes. The experiments using synthetic and real-world datasets demonstrate that the flexibility of the proposed approach achieves a faithful synthesis of temporal changes in material appearance.

# 321

Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering

Baixin Xu · Jiangbei Hu · Fei Hou · Kwan-Yee Lin · Wayne Wu · Chen Qian · Ying He

The growing capabilities of neural rendering have increased the demand for new techniques that enable intuitive editing of 3D objects, particularly when they are represented as neural implicit surfaces. In this paper, we present a novel neural algorithm for parameterizing neural implicit surfaces to simple parametric domains, such as spheres and polycubes, thereby facilitating visualization and various editing tasks. Specifically, for polycubes, our method allows the user to specify the desired number of cubes for the domain and then learns a cube configuration that closely resembles the geometry of the target 3D object. It then computes a bi-directional deformation between the object and the domain, utilizing a forward mapping from points on the object's zero level set to the parametric domain, followed by an inverse deformation for backward mapping. To ensure the map is nearly bijective, we employ a cycle loss while optimizing the smoothness of both deformations. The quality of the computed parameterization, as assessed by angle and area distortions, is guaranteed through the use of a Laplacian regularizer and an optimized learned parametric domain. Designed for compatibility, our framework integrates seamlessly with existing neural rendering pipelines, taking multi-view images of a single object or multiple objects of similar geometries as input to reconstruct 3D geometry and compute the corresponding texture map. Our method is fully automatic and end-to-end, eliminating the need for any prior information. We also introduce a simple yet effective technique for intrinsic radiance decomposition, facilitating both view-independent material editing and view-dependent shading editing. Our method allows for the immediate rendering of edited textures through volume rendering, without the need for network re-training. We demonstrate the effectiveness of our method on images of human heads and man-made objects. We will make the source code publicly available.

# 337

Instant Uncertainty Calibration of NeRFs Using a Meta-Calibrator

Niki Amini-Naieni · Tomas Jakab · Andrea Vedaldi · Ronald Clark

Although Neural Radiance Fields (NeRFs) have markedly improved novel view synthesis, accurate uncertainty quantification in their image predictions remains an open problem. The prevailing methods for estimating uncertainty, including the state-of-the-art Density-aware NeRF Ensembles (DANE) [29], quantify uncertainty without calibration. This frequently leads to over- or under-confidence in image predictions, which can undermine their real-world applications. In this paper, we propose a method which, for the first time, achieves calibrated uncertainties for NeRFs. To accomplish this, we overcome a significant challenge in adapting existing calibration techniques to NeRFs: a need to hold out ground truth images from the target scene, reducing the number of images left to train the NeRF. This issue is particularly problematic in sparse-view settings, where we can operate with as few as three images. To address this, we introduce the concept of a meta-calibrator that performs uncertainty calibration for NeRFs with a single forward pass without the need for holding out any images from the target scene. Our meta-calibrator is a neural network that takes as input the NeRF images and uncalibrated uncertainty maps and outputs a scene-specific calibration curve that corrects the NeRF’s uncalibrated uncertainties. We show that the meta-calibrator can generalize on unseen scenes and achieves well-calibrated and state-of-the-art uncertainty for NeRFs, significantly beating DANE and other approaches. This opens opportunities to improve applications that rely on accurate NeRF uncertainty estimates such as next-best view planning and potentially more trustworthy image reconstruction for medical diagnosis.

# 325

Strong Double Blind

GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views

Vinayak Gupta · Rongali Simhachala Venkata Girish · Mukund Varma T · Ayush Tewari · Kaushik Mitra

Neural rendering methods can achieve near-photorealistic image synthesis of scenes from posed input images. However, when the images are imperfect, e.g., captured in very low-light conditions, state-of-the-art methods fail to reconstruct high-quality 3D scenes. Recent approaches have tried to address this limitation by modeling various degradation processes in the image formation model; however, this limits them to specific image degradations. In this paper, we propose a generalizable neural rendering method that can perform high-fidelity novel view synthesis under several degradations. Our method, GAURA, is learning-based and does not require any test-time scene-specific optimization. It is trained on a synthetic dataset that includes several degradation types. GAURA outperforms state-of-the-art methods on several benchmarks for low-light enhancement, dehazing, deraining, and on-par for motion deblurring. Further, our model can be efficiently fine-tuned to any new incoming degradation using minimal data. We thus demonstrate adaptation results on two new degradations, desnowing and removing defocus blur. Code will be made available upon acceptance.

# 307

Strong Double Blind

Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization

Weihang Liu · Xue Xian Zheng · Jingyi Yu · Xin Lou

The recent popular radiance field models, exemplified by Neural Radiance Fields (NeRF), Instant-NGP and 3D Gaussian Splatting, are designed to represent 3D content by that training models for each individual scene. This unique characteristic of scene representation and per-scene training distinguishes radiance field models from other neural models, because complex scenes necessitate models with higher representational capacity and vice versa. In this paper, we propose content-aware radiance fields, aligning the model complexity with the scene intricacies through Adversarial Content-Aware Quantization (A-CAQ). Specifically, we make the bitwidth of parameters differentiable and trainable, tailored to the unique characteristics of specific scenes and requirements. The proposed framework has been assessed on Instant-NGP, a well-known NeRF variant and evaluated using various datasets. Experimental results demonstrate a notable reduction in computational complexity, while preserving the requisite reconstruction and rendering quality, rendering it beneficial for practical deployment of radiance fields models. Codes will be released soon.

# 309

Collaborative Control for Geometry-Conditioned PBR Image Generation

Shimon Vainer · Mark Boss · Mathias Parger · Konstantin Kutsy · Dante De Nigris · Ciara Rowles · Nicolas Perony · Simon Donné

Current 3D content generation approaches build on diffusion models that output RGB images. Modern graphics pipelines, however, require physically-based rendering (PBR) material properties. We propose to model the PBR image distribution directly, avoiding photometric inaccuracies in RGB generation and the inherent ambiguity in extracting PBR from RGB. Existing paradigms for cross-modal fine-tuning are not suited for PBR generation due to both a lack of data and the high dimensionality of the output modalities: we overcome both challenges by retaining a frozen RGB model and tightly linking a newly trained PBR model using a novel cross-network communication paradigm. As the base RGB model is fully frozen, the proposed method does not risk catastrophic forgetting during fine-tuning and remains compatible with techniques such as IPAdapter pretrained for the base RGB model. We validate our design choices, robustness to data sparsity, and compare against existing paradigms with an extensive experimental section.

# 335

Strong Double Blind

KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter

Yifan Zhan · Zhuoxiao Li · Muyao Niu · Zhihang Zhong · Shohei Nobuhara · Ko Nishino · zheng yinqiang

We introduce KFD-NeRF, a novel dynamic neural radiance field integrated with an efficient and high-quality motion reconstruction framework based on Kalman filtering. Our key idea is to model the dynamic radiance field as a dynamic system whose temporally varying states are estimated based on two sources of knowledge: observations and predictions. We introduce a novel plug-in Kalman filter guided deformation field that enables accurate deformation estimation from scene observations and predictions. We use a shallow Multi-Layer Perceptron (MLP) for observations and model the motion as locally linear to calculate predictions with motion equations. To further enhance the performance of the observation MLP, we introduce regularization in the canonical space to facilitate the network's ability to learn warping for different frames. Additionally, we employ an efficient tri-plane representation for encoding the canonical space, which has been experimentally demonstrated to converge quickly with high quality. This enables us to use a shallower observation MLP, consisting of just two layers in our implementation. We conduct experiments on synthetic and real data and compare with past dynamic NeRF methods. Our KFD-NeRF demonstrates similar or even superior rendering performance within comparable computational time and achieves state-of-the-art view synthesis performance with thorough training.

# 166

Strong Double Blind

Weight Conditioning for Smooth Optimization of Neural Networks

Hemanth Saratchandran · Thomas X Wang · Simon Lucey

In this article, we introduce a novel normalization technique for neural network weight matrices, which we term weight conditioning. This approach aims to narrow the gap between the smallest and largest singular values of the weight matrices, resulting in better-conditioned matrices. The inspiration for this technique partially derives from numerical linear algebra, where well-conditioned matrices are known to facilitate stronger convergence results for iterative solvers. We provide a theoretical foundation demonstrating that our normalization technique smoothens the loss landscape, thereby enhancing convergence of stochastic gradient descent algorithms. Empirically, we validate our normalization across various neural network architectures, including Convolutional Neural Networks (CNNs), Vision Transformers (ViT), Neural Radiance Fields (NeRF), and 3D shape modeling. Our findings indicate that our normalization method is not only competitive but also outperforms existing weight normalization techniques from the literature.

# 342

URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields

Bo Xu · Liu Ziao · Mengqi GUO · jiancheng Li · Gim Hee Lee

We propose a novel rolling shutter bundle adjustment method for neural radiance fields (NeRF), which utilizes the unordered rolling shutter (RS) images to obtain the implicit 3D representation. Existing NeRF methods suffer from low-quality images and inaccurate initial camera poses due to the RS effect in the image, whereas, the previous method that incorporates the RS into NeRF requires strict sequential data input, limiting its widespread applicability. In constant, our method recovers the physical formation of RS images by estimating camera poses and velocities, thereby removing the input constraints on sequential data. Moreover, we adopt a coarse-to-fine training strategy, in which the RS epipolar constraints of the pairwise frames in the scene graph are used to detect the camera poses that fall into local minima. The poses detected as outliers are corrected by the interpolation method with neighboring poses. The experimental results validate the effectiveness of our method over state-of-the-art works and demonstrate that the reconstruction of 3D representations is not constrained by the requirement of video sequence input.

# 324

Strong Double Blind

MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo

Ashish Tiwari · Satoshi Ikehata · Shanmuganathan Raman

Photometric stereo typically demands intricate data acquisition setups involving multiple light sources to recover surface normals accurately. In this paper, we propose an attention-based hourglass network that integrates single image-based inverse rendering and relighting within a single unified framework. We evaluate the performance of photometric stereo methods using these relit images and demonstrate how they can circumvent the underlying challenge of complex data acquisition. Our physically-based model is trained on a large synthetic dataset containing complex shapes with spatially varying BRDF and is designed to handle indirect illumination effects to improve material reconstruction and relighting. Through extensive qualitative and quantitative evaluation, we demonstrate that the proposed framework generalizes well to real-world images, achieving high-quality shape, material estimation, and relighting. We assess these synthetically relit images over photometric stereo benchmark methods for their physical correctness and resulting normal estimation accuracy, paving the way towards single-shot photometric stereo through physically-based relighting. This work allows us to address the single image-based inverse rendering problem holistically, applying well to both synthetic and real data and taking a step towards mitigating the challenge of data acquisition in photometric stereo. The code will be made available for research purposes.

# 340

Strong Double Blind

TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

Jinjie Mai · Wenxuan Zhu · Sara Rojas Martinez · Jesus Zarzar · Abdullah Hamdi · Guocheng Qian · Bing Li · Silvio Giancola · Bernard Ghanem

Neural radiance field (NeRF) generally requires many images with accurate poses, which do not reflect realistic setups when views can be sparse and poses are imprecise. Previous solutions for sparse and noisy NeRF only consider local geometry consistency with a pair of views. Closely following bundle adjustment in Structure-from-Motion (SfM), we introduce TrackNeRF for more global-consistent geometry reconstruction and more accurate pose optimization. TrackNeRF introduces feature tracks, i.e. connected pixel trajectories across all visible views that correspond to the same 3D points. By enforcing reprojection consistency among feature tracks, TrackNeRF encourages holistic 3D consistency explicitly. Through extensive experiments, TrackNeRF sets a new benchmark in noisy and sparse view reconstruction. In particular, TrackNeRF shows significant improvements over the state-of-the-art BARF and SPARF by 8 and 1 in terms of PSNR on DTU under various sparse and noisy view setups.

# 331

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Zehao Zhu · Zhiwen Fan · Yifan Jiang · Zhangyang Wang

Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a Few-Shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, Shiny, and Blender. Code will be made available.

# 341

Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion

Otto Seiskari · Jerry Ylilammi · Valtteri Kaatrasalo · Pekka Rantalankila · Matias Turkulainen · Juho Kannala · Esa Rahtu · Arno Solin

High-quality scene reconstruction and novel view synthesis based on Gaussian Splatting (3DGS) typically require steady, high-quality photographs, often impractical to capture with handheld cameras. We present a method that adapts to camera motion and allows high-quality scene reconstruction with handheld video data suffering from motion blur and rolling shutter distortion. Our approach is based on detailed modelling of the physical image formation process and utilizes velocities estimated using visual-inertial odometry (VIO). Camera poses are considered non-static during the exposure time of a single image frame and camera poses are further optimized in the reconstruction process. We formulate a differentiable rendering pipeline that leverages screen space approximation to efficiently incorporate rolling-shutter and motion blur effects into the 3DGS framework. Our results with both synthetic and real data demonstrate superior performance in mitigating camera motion over existing methods, thereby advancing 3DGS in naturalistic settings.

# 338

DoubleTake: Geometry Guided Depth Estimation

Mohamed Sayed · Filippo Aleotti · Jamie Watson · Zawar Qureshi · Guillermo Garcia-Hernando · Gabriel Brostow · Sara Vicente · Michael Firman

Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared with individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

# 306

Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal

YUXIN WANG · Qianyi Wu · Guofeng Zhang · Dan Xu

This paper tackles the intricate challenge of object removal to update the radiance field using the 3D Gaussian Splatting. The main challenges of this task lie in the preservation of geometric consistency and the maintenance of texture coherence in the presence of the substantial discrete nature of Gaussian primitives. We introduce a robust framework specifically designed to overcome these obstacles. The key insight of our approach is the enhancement of information exchange among visible and invisible areas, facilitating content restoration in terms of both geometry and texture. Our methodology begins with optimizing the positioning of Gaussian primitives to improve geometric consistency across both removed and visible areas, guided by an online registration process informed by monocular depth estimation. Following this, we employ a novel feature propagation mechanism to bolster texture coherence, leveraging a cross-attention design that bridges sampling Gaussians from both uncertain and certain areas. This innovative approach significantly refines the texture coherence within the final radiance field. Extensive experiments validate that our method not only elevates the quality of novel view synthesis for scenes undergoing object removal but also showcases notable efficiency gains in training and rendering speeds.

# 314

SAGS: Structure-Aware 3D Gaussian Splatting

Evangelos Ververas · Rolandos Alexandros Potamias · Song Jifei · Jiankang Deng · Stefanos Zafeiriou

Following the advent of NeRFs, 3D Gaussian Splatting (3D-GS) has paved the way to real-time neural rendering overcoming the computational burden of volumetric methods. Following the pioneering work of 3D-GS, several methods have attempted to achieve compressible and high-fidelity performance. However, by employing a geometry-agnostic optimization scheme, these methods neglect the inherent 3D structure of the scene, thereby restricting the expressivity and the quality of the representation, resulting in various floating points and artifacts. In this work, we propose a structure-aware Gaussian Splatting method (SAGS) that implicitly encodes the geometry of the scene, which reflects to state-of-the-art rendering performance and reduced storage requirements on benchmark novel-view synthesis datasets. SAGS is founded on a local-global graph representation that facilitates the learning of complex scenes and enforces meaningful point displacements that preserve the scene's geometry. Additionally, we introduce a lightweight version of SAGS, using a simple yet effective mid-point interpolation scheme, which showcases a compact representation of the scene with up to 20$\times$ size reduction without the reliance on any compression strategies. Extensive experiments across multiple benchmark datasets demonstrate the superiority of SAGS compared to state-of-the-art 3D-GS methods under both rendering quality and model size. Besides, we demonstrate that our structure-aware method can effectively mitigate floating artifacts and irregular distortions of previous methods while obtaining precise depth maps. Code and models will be publicly available.

# 317

Compact 3D Scene Representation via Self-Organizing Gaussian Grids

Wieland Morgenstern · Florian Barthel · Anna Hilsmann · Peter Eisert

3D Gaussian Splatting has recently emerged as a highly promising technique for modeling of static 3D scenes. In contrast to Neural Radiance Fields, it utilizes efficient rasterization allowing for very fast rendering at high-quality. However, the storage size is significantly higher, which hinders practical deployment, e.g. on resource constrained devices. In this paper, we introduce a compact scene representation organizing the parameters of 3D Gaussian Splatting (3DGS) into a 2D grid with local homogeneity, ensuring a drastic reduction in storage requirements without compromising visual quality during rendering. Central to our idea is the explicit exploitation of perceptual redundancies present in natural scenes. In essence, the inherent nature of a scene allows for numerous permutations of Gaussian parameters to equivalently represent it. To this end, we propose a novel highly parallel algorithm that regularly arranges the high-dimensional Gaussian parameters into a 2D grid while preserving their neighborhood structure. During training, we further enforce local smoothness between the sorted parameters in the grid. The uncompressed Gaussians use the same structure as 3DGS, ensuring a seamless integration with established renderers. Our method achieves a reduction factor of 17x to 42x in size for complex scenes with no increase in training time, marking a substantial leap forward in the domain of 3D scene distribution and consumption.

# 316

HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression

Yihang Chen · Qianyi Wu · Weiyao Lin · Mehrtash Harandi · Jianfei Cai

3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To address this, we make use of the relations between the unorganized anchors and the structured hash grid, leveraging their mutual information for context modeling, and propose a Hash-grid Assisted Context (HAC) framework for highly compact 3DGS representation. Our approach introduces a binary hash grid to establish continuous spatial consistencies, allowing us to unveil the inherent spatial relations of anchors through a carefully designed context model. To facilitate entropy coding, we utilize Gaussian distributions to accurately estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Additionally, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Importantly, our work is the pioneer to explore context-based compression for 3DGS representation, resulting in a remarkable size reduction of over 75X compared to vanilla 3DGS, while simultaneously improving fidelity, and achieving over 11X size reduction over SOTA 3DGS compression approach Scaffold-GS. Our code will be publicly available.

# 319

Strong Double Blind

GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction

Yuxuan Mu · Xinxin Zuo · Chuan Guo · Yilin Wang · Juwei Lu · Xiaofei Wu · Xu Songcen · Peng Dai · Youliang Yan · Li Cheng

We present GSD, a diffusion model approach based on Gaussian Splatting (GS) representation for 3D object reconstruction from a single view. Prior works suffer from inconsistent 3D geometry or mediocre rendering quality due to improper representations. We take a step towards resolving these shortcomings by utilizing the recent state-of-the-art 3D explicit representation, Gaussian Splatting, and an unconditional diffusion model. This model learns to generate 3D objects represented by sets of GS ellipsoids. With these strong generative 3D priors, though learning unconditionally, the diffusion model is ready for view-guided reconstruction without further model fine-tuning. This is achieved by propagating fine-grained 2D features through the efficient yet flexible splatting function and the guided denoising sampling process. In addition, a 2D diffusion model is further employed to enhance rendering fidelity, and improve reconstructed GS quality by polishing and re-using the rendered images. The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views. Experiments on the challenging real-world CO3D dataset demonstrate the superiority of our approach.

# 336

Concise Plane Arrangements for Low-Poly Surface and Volume Modelling

Raphael Sulzer · Florent Lafarge

Plane arrangements are a useful tool for surface and volume modelling. However, their main drawback is poor scalability. We introduce two key novelties that enable the construction of plane arrangements for complex objects and entire scenes: an ordering scheme for the plane insertion and the direct use of input points during arrangement construction. Both ingredients reduce the number of unwanted splits, resulting in improved scalability of the construction mechanism by up to two orders of magnitude compared to existing algorithms. We further introduce a remeshing and simplification technique that allows us to extract low-polygon surface meshes and lightweight convex decompositions of volumes from the arrangement. We show that our approach leads to state-of-the-art results for the aforementioned tasks by comparing it to learning-based and traditional approaches on various different datasets.

# 312

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Mingqiao Ye · Martin Danelljan · Fisher Yu · Lei Ke

The recent Gaussian Splatting achieves high-quality and real-time novel-view synthesis of the 3D scenes. However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. We augment each Gaussian with a compact Identity Encoding, allowing the Gaussians to be grouped according to their object instance or stuff membership in the 3D scene. Instead of resorting to expensive 3D labels, we supervise the Identity Encodings during the differentiable rendering by leveraging the 2D mask predictions by Segment Anything Model (SAM), along with introduced 3D spatial consistency regularization. Compared to the implicit NeRF representation, we show that the discrete and grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency. Based on Gaussian Grouping, we further propose a local Gaussian Editing scheme, which shows efficacy in versatile scene editing applications, including 3D object removal, inpainting, colorization, style transfer and scene recomposition. Our code and models will be released.

# 294

SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

Huan-ang Gao · Mingju Gao · Jiaju Li · Wenyi Li · Rong Zhi · Hao Tang · HAO ZHAO

Semantic image synthesis (SIS) shows promising potential for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its image-level control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the root of these problems as a mismatch between the training-noised data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K. The code and models will be made publicly available.

# 302

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

Yifei Zeng · Yanqin Jiang · Siyu Zhu · Yuanxun Lu · Youtian Lin · Hao Zhu · Weiming Hu · Xun Cao · Yao Yao

Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images for each frame by anchoring on the corresponding input frame, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the cross-frame attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.

# 313

Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting

Junwu Zhang · Zhenyu Tang · Yatian Pang · Xinhua Cheng · Peng Jin · Yida Wei · xing zhou · munan ning · Li Yuan

Recent image-to-3D methods achieve impressive results with plausible 3D geometry due to the development of diffusion models and optimization techniques. However, existing image-to-3D methods suffer from texture deficiencies in novel views, including multi-view inconsistency and quality degradation. To alleviate multi-view bias and enhance image quality in novel-view textures, we present Repaint123, a fast image-to-3d approach for creating high-quality 3D content with detailed textures. Repaint123 proposes a progressively repainting strategy to simultaneously enhance the consistency and the quality of textures across different views, generating invisible regions according to visible textures, with the visibility map calculated by the depth alignment across views. Furthermore, multiple control techniques including reference-driven information injection and coarse-based depth guidance are introduced to alleviate the texture bias accumulated during the repainting process for improved consistency and quality. Extensive experiments demonstrate the superior ability of our method in creating 3D content with consistent and detailed textures in 2 minutes.

# 300

GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning

Animesh Karnewar · Roman Shapovalov · Tom Monnier · Andrea Vedaldi · Niloy Mitra · David Novotny

Encoding information from 2D views of an object into a 3D representation is crucial for generalized 3D feature extraction and learning. Such features then enable various 3D applications, including reconstruction and generation. We propose GOEmbed: Gradient Origin Embeddings that encodes input 2D images into any 3D representation, without requiring a pre-trained image feature extractor; unlike typical prior approaches in which input images are either encoded using 2D features extracted from large pre-trained models, or customized features are designed to handle different 3D representations; or worse, encoders may not yet be available for specialized 3D neural representations such as MLPs and Hash-grids. We extensively evaluate our proposed general-purpose GOEmbed under different experimental settings on the OmniObject3D benchmark. First, we evaluate how well the mechanism compares against prior encoding mechanisms on multiple 3D representations using an illustrative experiment called Plenoptic-Encoding. Second, the efficacy of the GOEmbed mechanism is further demonstrated by achieving a new SOTA FID of 22.12 on the OmniObject3D generation task using a combination of GOEmbed and DFM (Diffusion with Forward Models), which we call GOEmbedFusion. Finally, we evaluate how the GOEmbed mechanism bolsters sparse-view 3D reconstruction pipelines.

# 290

Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation

Yujin Chen · Yinyu Nie · Benjamin Ummenhofer · Reiner Birkl · Michael Paulitsch · Matthias Müller · Matthias Niessner

We present Mesh2NeRF, an approach to derive ground-truth radiance fields from textured meshes for 3D generation tasks. Many 3D generative approaches represent 3D scenes as radiance fields for training. Their ground-truth radiance fields are usually fitted from multi-view renderings from a large-scale synthetic 3D dataset, which often results in artifacts due to occlusions or under-fitting issues. In Mesh2NeRF, we propose an analytic solution to directly obtain ground-truth radiance fields from 3D meshes, characterizing the density field with an occupancy function featuring a defined surface thickness, and determining view-dependent color through a reflection function considering both the mesh and environment lighting. Mesh2NeRF extracts accurate radiance fields which provides direct supervision for training generative NeRFs and single scene representation. We validate the effectiveness of Mesh2NeRF across various tasks, achieving a noteworthy 3.12dB improvement in PSNR for view synthesis in single scene representation on the ABO dataset, a 0.69 PSNR enhancement in the single-view conditional generation of ShapeNet Cars, and notably improved mesh extraction from NeRF in the unconditional generation of Objaverse Mugs.

# 282

Strong Double Blind

FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis

Vishnu Mani Hema · Shubhra Aich · Christian Haene · Jean-Charles Bazin · Fernando de la Torre

The advancement in deep implicit modeling and articulated models has significantly enhanced the process of digitizing human figures in 3D from just a single image. While state-of-the-art methods have greatly improved geometric precision, the challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images. This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets, whereas their 2D counterparts are abundant and easily accessible. To address this issue, our paper proposes leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization. We incorporate 2D priors from the fashion dataset to learn the occluded back view, refined with our proposed domain alignment strategy. We then fuse this information with the input image to obtain a fully textured mesh of the given person. Through extensive experimentation on standard 3D human benchmarks, we demonstrate the superior performance of our approach in terms of both texture and geometry. Code and dataset is available at https://github.com/humansensinglab/FAMOUS.

# 276

Retargeting Visual Data with Deformation Fields

Tim Elsner · Julia Berger · Tong Wu · Victor Czech · LIN GAO · Leif Kobbelt

Seam carving is an image editing method that enables content-aware resizing, including operations like removing objects. However, the seam-finding strategy based on dynamic programming or graph-cut limits its applications to broader visual data formats and degrees of freedom for editing. Our observation is that describing the editing and retargeting of images more generally by a deformation field yields a generalisation of content-aware deformations. We propose to learn a deformation with a neural network that keeps the output plausible while trying to deform it only in places with low information content. This technique applies to different kinds of visual data, including images, 3D scenes given as neural radiance fields, or even polygon meshes. Experiments conducted on different visual data show that our method achieves better content-aware retargeting compared to previous methods.

# 275

LatentEditor: Text Driven Local Editing of 3D Scenes

Umar Khalid · Hasan Iqbal · Muhammad Tayyab · Md Nazmul Karim · Jing Hua · Chen Chen

While neural fields have made significant strides in view synthesis and scene reconstruction, editing them poses a formidable challenge due to their implicit encoding of geometry and texture information from multi-view inputs. In this paper, we introduce LatentEditor, an innovative framework designed to empower users with the ability to perform precise and locally controlled editing of neural fields using text prompts. Leveraging denoising diffusion models, we successfully embed real-world scenes into the latent space, resulting in a faster and more adaptable NeRF backbone for editing compared to traditional methods. To enhance editing precision, we introduce a delta score to calculate the 2D mask in the latent space that serves as a guide for local modifications while preserving irrelevant regions. Our novel pixel-level scoring approach harnesses the power of InstructPix2Pix (IP2P) to discern the disparity between IP2P conditional and unconditional noise predictions in the latent space. The edited latents conditioned on the 2D masks are then iteratively updated in the training set to achieve 3D local editing. Our approach achieves faster editing speeds and superior output quality compared to existing 3D editing models, bridging the gap between textual instructions and high-quality 3D scene editing in latent space. We show the superiority of our approach on four benchmark 3D datasets, LLFF, IN2N, NeRFStudio and NeRF-Art.

# 305

StyleCity: Large-Scale 3D Urban Scenes Stylization

Yingshu Chen · Huajian Huang · Tuan-Anh Vu · Ka Chun Shum · Sai Kit Yeung

Creating large-scale virtual urban scenes with variant styles is inherently challenging. To facilitate prototypes of virtual production and bypass the need for complex materials and lighting setups, we introduce the first vision-and-text-driven texture stylization system for large-scale urban scenes, StyleCity. Taking an image and text as references, StyleCity stylizes a 3D textured mesh of a large-scale urban scene in a semantics-aware fashion and generates a harmonic omnidirectional sky background. To achieve that, we propose to stylize a neural texture field by transferring 2D vision-and-text priors to 3D globally and locally. During 3D stylization, we progressively scale the planned training views of the input 3D scene at different levels in order to preserve high-quality scene content. We then optimize the scene style globally by adapting the scale of the style image with the scale of the training views. Moreover, we enhance local semantics consistency by the semantics-aware style loss which is crucial for photo-realistic stylization. Besides texture stylization, we further adopt a generative diffusion model to synthesize a style-consistent omnidirectional sky image, which offers a more immersive atmosphere and assists the semantic stylization process. The stylized neural texture field can be baked into an arbitrary-resolution texture, enabling seamless integration into conventional rendering pipelines and significantly easing the virtual production prototyping process. Extensive experiments demonstrate our stylized scenes' superiority in qualitative and quantitative performance and user preferences.

# 315

Strong Double Blind

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

Ruofan Liang · Zan Gojcic · Merlin Nimier-David · David Acuna · Nandita Vijaykumar · Sanja Fidler · Zian Wang

The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently ``understand'' the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

# 289

Strong Double Blind

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

Zongrui Li · Minghui Hu · Qian Zheng · Xudong Jiang

Recent advancements in text-to-3D generation significantly improved outcome quality despite issues like oversaturation and missing detail. Upon thoroughly examining score distillation, we've identified similarities between score and consistency distillation methods. In this work, We first elucidate the equivalent formulation of score distillation via the consistency function format, which mitigates the divide between text-to-image distillation and text-to-3D distillation and benefits the research in both fields. Our investigation further reveals the current methods under consistency distillation framework for 3D contexts still have limitations, such as the distillation errors and inconsistencies of trajectories. To address this issue, we propose Guided Consistency Sampling (GCS), which is seamlessly integrated with advanced rendering techniques and Gaussian Splatting (GS). GCS consists of three parts: A compact consistency loss leads to a better generator for the origin, a conditional guidance score enriches the detail, and a constrain on the pixel domain enhances the color and light impact. Furthermore, we have observed persistent 3D rendering issues inherent in the optimization process of GS-based methods, such as oversaturation. We attribute this to the accumulation of highlights and introduce the Brightness-equalized Generation (BEG) method to effectively alleviate the problem. Experimental results demonstrate that our approach yields higher-quality outcomes with more intricate details than the current state-of-the-art methods.

# 274

Strong Double Blind

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Zizheng Yan · Jiapeng Zhou · Fanpeng Meng · Yushuang Wu · Lingteng Qiu · Zisheng Ye · Shuguang Cui · Guanying Chen · XIAOGUANG HAN

Text-to-3D generation has recently seen significant progress. To enhance its practicality in real-world applications, it is crucial to generate multiple independent objects with interactions, similar to layer-compositing in 2D image editing. However, existing text-to-3D methods struggle with this task, as they are designed to generate either non-independent objects or independent objects lacking spatially plausible interactions. Addressing this, we propose DreamDissector, a text-to-3D method capable of generating multiple independent objects with interactions. DreamDissector accepts a multi-object text-to-3D NeRF as input and produces independent textured meshes. To achieve this, we introduce the Neural Category Field (NeCF) for disentangling the input NeRF. Additionally, we present the Category Score Distillation Sampling (CSDS), facilitated by a Deep Concept Mining (DCM) module, to tackle the concept gap issue in diffusion models. By leveraging NeCF and CSDS, we can effectively derive sub-NeRFs from the original scene. Further refinement enhances geometry and texture. Our experimental results validate the effectiveness of DreamDissector, providing users with novel means to control 3D synthesis at the object level and potentially opening avenues for various creative applications in the future.

# 244

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

Sisi Dai · Wenhao Li · Haowen Sun · Haibin Huang · Chongyang Ma · Hui Huang · Kai Xu · Ruizhen Hu

In this study, we tackle the complex task of generating 3D human-object interactions (HOI) from textual descriptions in a zero-shot text-to-3D manner. We identify and address two key challenges: the unsatisfactory outcomes of direct text-to-3D methods in HOI, largely due to the lack of paired text-interaction data, and the inherent difficulties in simultaneously generating multiple concepts with complex spatial relationships. To effectively address these issues, we present InterFusion, a two-stage framework specifically designed for HOI generation. InterFusion involves human pose estimations derived from text as geometric priors, which simplifies the text-to-3D conversion process and introduces additional constraints for accurate object generation. At the first stage, InterFusion extracts 3D human poses from a synthesized image dataset depicting a wide range of interactions, subsequently mapping these poses to interaction descriptions. The second stage of InterFusion capitalizes on the latest developments in text-to-3D generation, enabling the production of realistic and high-quality 3D HOI scenes. This is achieved through a local-global optimization process, where the generation of human body and object is optimized separately, and jointly refined with a global optimization of the entire scene, ensuring a seamless and contextually coherent integration. Our experimental results affirm that InterFusion significantly outperforms existing state-of-the-art methods in 3D HOI generation.

# 285

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

Zhengming Yu · Zhiyang Dou · Xiaoxiao Long · Cheng Lin · Zekun Li · Yuan Liu · Norman Müller · Taku Komura · Marc Habermann · Christian Theobalt · Xin Li · Wenping Wang

We present Surf-D, a novel method for generating high-quality 3D shapes as Surfaces with arbitrary topologies using Diffusion models. Previous methods explored shape generation with different representations and they suffer from limited topologies and poor geometry details. To generate high-quality surfaces of arbitrary topologies, we use the Unsigned Distance Field (UDF) as our surface representation to accommodate arbitrary topologies. Furthermore, we propose a new pipeline that employs a point-based AutoEncoder to learn a compact and continuous latent space for accurately encoding UDF and support high-resolution mesh extraction. We further show that our new pipeline significantly outperforms the prior approaches to learning the distance fields, such as the grid-based AutoEncoder, which is not scalable and incapable of learning accurate UDF. In addition, we adopt a curriculum learning strategy to efficiently embed various surfaces. With the pretrained shape latent space, we employ a latent diffusion model to acquire the distribution of various shapes. Extensive experiments are presented on using Surf-D for unconditional generation, category conditional generation, image conditional generation, and text-to-shape tasks. The experiments demonstrate the superior performance of Surf-D in shape generation across multiple modalities as conditions. The code will be publicly available upon paper publication.

# 287

AWOL: Analysis WithOut synthesis using Language

Silvia Zuffi · Michael J. Black

There exist many classical parametric 3D shape models but creating novel shapes with such models requires expert knowledge of their parameters. For example, imagine creating a specific type of tree using procedural graphics or a new kind of animal from a statistical shape model. Our key idea is to leverage language to control such existing models to produce novel shapes. This involves learning a mapping between the latent space of a vision-language model and the parameter space of the 3D model, which we do using a small set of shape and text pairs. Our hypothesis is this mapping from language to parameters allows us to generate parameters for objects that were never seen during training. If the mapping between language and parameters is sufficiently smooth, then interpolation or generalization in language should translate appropriately into novel 3D shapes. We test our approach with two very different types of parametric shape models (quadrupeds and arboreal trees). We use a learned statistical shape model of quadrupeds and show that we can use text to generate new animals not present during training. In particular, we demonstrate state-of-the-art shape estimation of 3D dogs. This work also constitutes the first language-driven method for generating 3D trees. Finally, embedding images in the CLIP latent space enables us to generate animals and trees directly from images.

# 273

Strong Double Blind

Improving Virtual Try-On with Garment-focused Diffusion Models

Siqi Wan · Yehao Li · Jingwen Chen · Yingwei Pan · Ting Yao · Yang Cao · Tao Mei

Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches.

# 283

GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns

Maria Korosteleva · Timur Levent Kesdogan · Fabian Kemper · Stephan Wenninger · Jasmin Koller · Yuhan Zhang · Mario Botsch · Olga Sorkine-Hornung

Recent research interest in learning-based processing of garments, from virtual fitting to generation and reconstruction, stumbles on a scarcity of high-quality public data in the domain. We contribute to resolving this need by presenting the first large-scale synthetic dataset of 3D made-to-measure garments with sewing patterns, as well as its generation pipeline. GarmentCodeData contains 115,000 data points that cover a variety of designs in many common garment categories: tops, shirts, dresses, jumpsuits, skirts, pants, etc., fitted to a variety of body shapes sampled from a custom statistical body model based on CAESAR, as well as a standard reference body shape, applying three different textile materials. To enable the creation of datasets of such complexity, we introduce a set of algorithms for automatically taking tailor's measures on sampled body shapes, sampling strategies for sewing pattern design, and propose an automatic, open-source 3D garment draping pipeline based on a fast XPBD simulator, while contributing several solutions for collision resolution and drape correctness to enable scalability.

# 231

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Shenhao Zhu · Junming Chen · Zuozhuo Dai · Zilong Dong · Yinghui Xu · Xun Cao · Yao Yao · Hao Zhu · Siyu Zhu

In the present study, we introduce a methodology for human image animation that exploits a 3D human parametric model within a latent diffusion framework to improve shape alignment and motion guidance in contemporary human generative techniques. Our method employs the SMPL model as the 3D human parametric model to provide a unified representation of body shape and pose, facilitating the capture of intricate human geometry and motion characteristics from source videos and representing an obvious advancement in the generation of dynamic human videos. Specifically, we incorporate the rendered depth images, normal maps, and semantic maps derived from the SMPL sequences, along with skeleton-based motion guidance, to enrich the input to the latent diffusion model with comprehensive 3D shape information and detailed pose attributes. By weighting the shape and motion latent representations through self-attention mechanisms in the spatial domain, we utilize a multi-layer semantic fusion of these latent representations as a conditioning in the latent diffusion model for human image animation. The effectiveness and versatility of our methodology have been verified through extensive experiments conducted on various datasets, demonstrating its ability to generate high-quality human animations that accurately capture both pose and shape variations.

# 286

DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects

Dominik Bauer · Zhenjia Xu · Shuran Song

Manipulation of elastoplastic objects like dough often involves topological changes such as splitting and merging. The ability to accurately predict these topological changes that a specific action might incur is critical for planning interactions with elastoplastic objects. We present DoughNet, a Transformer-based architecture for handling these challenges, consisting of two components. First, a denoising autoencoder represents deformable objects of varying topology as sets of latent codes. Second, a visual predictive model performs autoregressive set prediction to determine long-horizon geometrical deformation and topological changes purely in latent space. Given a partial initial state and desired manipulation trajectories, it infers all resulting object geometries and topologies at each step. Our experiments in simulated and real environments show that DoughNet is able to significantly outperform related approaches that consider deformation only as geometrical change.

# 288

Strong Double Blind

Generating 3D House Wireframes with Semantics

Xueqi Ma · Yilin Liu · Wenjun Zhou · Ruowei Wang · Hui Huang

We present a new approach for generating 3D house wireframes with semantic enrichment using an autoregressive model. Unlike conventional generative models that independently process vertices, edges, and faces, our approach employs a unified wire-based representation for improved coherence in learning 3D wireframe structures. By re-ordering wire sequences based on semantic meanings, we facilitate seamless semantic integration during sequence generation. Our two-phase technique merges a graph-based autoencoder with a transformer-based decoder to learn latent geometric tokens and generate semantic-aware wireframes. Through iterative prediction and decoding during inference, our model produces detailed wireframes that can be easily segmented into distinct components, such as walls, roofs, and rooms, reflecting the semantic essence of the shape. Empirical results on a comprehensive house dataset validate the superior accuracy, novelty, and semantic fidelity of our model compared to existing generative models.

# 279

LayoutFlow: Flow Matching for Layout Generation

Julian Jorge Andrade Guerreiro · Naoto Inoue · Kento Masui · Mayu Otani · Hideki Nakayama

Finding a suitable layout represents a crucial task for diverse applications in graphic design. Motivated by simpler and smoother sampling trajectories, we explore the use of Flow Matching as an alternative to current diffusion-based layout generation models. Specifically, we propose LayoutFlow, an efficient flow-based model capable of generating high-quality layouts. Instead of progressively denoising the elements of a noisy layout, our method learns to gradually move, or flow, the elements of an initial sample until it reaches its final prediction. In addition, we employ a conditioning scheme that allows us to handle various generation tasks with varying degrees of conditioning with a single model. Empirically, LayoutFlow performs on par with state-of-the-art models while being significantly faster.

# 284

Strong Double Blind

Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching

Dongliang Cao · Zorah Laehner · Florian Bernard

Most recent unsupervised non-rigid 3D shape matching methods are based on the functional map framework due to its efficiency and superior performance. Nevertheless, respective methods struggle to obtain spatially smooth pointwise correspondences due to the lack of proper regularisation. In this work, inspired by the success of message passing on graphs, we propose a synchronous diffusion process, which we use as regularisation to achieve smoothness in non-rigid 3D shape matching problems. The intuition of synchronous diffusion is that diffusing the same input function on two different shapes results in consistent outputs. Using different challenging datasets, we demonstrate that our novel regularisation can substantially improve the state-of-the-art in shape matching, especially in the presence of topological noise.

# 320

Strong Double Blind

Scalar Function Topology Divergence: Comparing Topology of 3D Objects

Ilya Trofimov · Daria Voronkova · Eduard Tulchinskii · Evgeny Burnaev · Serguei Barannikov

We propose a new topological tool for computer vision - Scalar Function Topology Divergence (SFTD), which measures the dissimilarity of multi-scale topology between sublevel sets of two functions having a common domain. Functions can be defined on an undirected graph or Euclidean space of any dimensionality. Most of the existing methods for comparing topology are based on Wasserstein distance between persistence barcodes and they don't take into account the localization of topological features. On the other hand, the minimization of SFTD ensures that the corresponding topological features of scalar functions are located in the same places. The proposed tool provides useful visualizations depicting areas where functions have topological dissimilarities. We provide applications of the proposed method to 3D computer vision. In particular, experiments demonstrate that SFTD improves the reconstruction of cellular 3D shapes from 2D fluorescence microscopy images, and helps to identify topological errors in 3D segmentation.

# 123

DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction

Yuxin Yao · Siyu Ren · Junhui Hou · Zhi Deng · Juyong Zhang · Wenping Wang

This paper explores the problem of reconstructing temporally consistent surfaces from a 3D point cloud sequence without correspondence. To address this challenging task, we propose DynoSurf, an unsupervised learning framework integrating a template surface representation with a learnable deformation field. Specifically, we design a coarse-to-fine strategy for learning the template surface based on the deformable tetrahedron representation. Furthermore, we propose a learnable deformation representation based on the learnable control points and blending weights, which can deform the template surface non-rigidly while maintaining the consistency of the local shape. Experimental results demonstrate the significant superiority of DynoSurf over current state-of-the-art approaches, showcasing its potential as a powerful tool for dynamic mesh reconstruction. The code will be publicly available.

# 129

Strong Double Blind

Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

Hao Xu · Xi Zhang · Xiaolin Wu

Compressing a set of unordered points is far more challenging than compressing images/videos of regular sample grids, because of the difficulties in characterizing neighboring relations in an irregular layout of points. Many researchers resort to voxelization to introduce regularity, but this approach suffers from quantization loss. In this research, we use the KNN method to determine the neighborhoods of raw surface points. This gives us a means to determine the spatial context in which the latent features of 3D points are compressed by arithmetic coding. As such, the conditional probability model is adaptive to local geometry, leading to significant rate reduction. Additionally, we propose a dual-layer architecture where a non-learning base layer reconstructs the main structures of the point cloud at low complexity, while a learned refinement layer focuses on preserving fine details. This design leads to reductions in model complexity and coding latency by two orders of magnitude compared to SOTA methods. Moreover, we incorporate an implicit neural representation (INR) into the refinement layer, allowing the decoder to sample points on the underlying surface at arbitrary densities. This work is the first to effectively exploit content-aware local contexts for compressing irregular raw point clouds, achieving high rate-distortion performance, low complexity, and the ability to function as an arbitrary-scale upsampling network simultaneously.

# 124

Strong Double Blind

FLAT: Flux-aware Imperceptible Adversarial Attacks on 3D Point Clouds

Keke Tang · Lujie Huang · Weilong Peng · Daizong Liu · Xiaofei Wang · Yang Ma · Ligang Liu · Zhihong Tian

Adversarial attacks on point clouds play a vital role in assessing and enhancing the adversarial robustness of 3D deep learning models. While employing a variety of geometric constraints, existing adversarial attack solutions often display unsatisfactory imperceptibility due to inadequate consideration of uniformity changes. In this paper, we propose FLAT, a novel framework designed to generate imperceptible adversarial point clouds by addressing the issue from a flux perspective. Specifically, during adversarial attacks, we assess the extent of uniformity alterations by calculating the flux of the local perturbation vector field. Upon identifying a high flux, which signals potential disruption in uniformity, the directions of the perturbation vectors are adjusted to minimize these alterations, thereby improving imperceptibility. Extensive experiments validate the effectiveness of FLAT in generating imperceptible adversarial point clouds, and its superiority to the state-of-the-art methods. Codes and pretrained models will be made public upon paper acceptance.

# 122

Strong Double Blind

Frugal 3D Point Cloud Model Training via Progressive Near Point Filtering and Fused Aggregation

Donghyun Lee · Yejin Lee · Jae W. Lee · Hongil Yoon

The increasing demand on higher accuracy and the rapid growth of 3D point cloud datasets have led to significantly higher training costs for 3D point cloud models in terms of both computation and memory bandwidth. Despite this, research on reducing this cost is relatively sparse. This paper identifies inefficiencies of unique operations in the 3D point cloud training pipeline: farthest point sampling (FPS) and forward and backward aggregation passes. To address the inefficiencies, we propose novel training optimizations that reduce redundant computation and memory accesses resulting from the operations. Firstly, we introduce Lightweight FPS (L-FPS), which employs progressive near point filtering to eliminate the redundant distance calculations inherent in the original farthest point sampling. Secondly, we introduce the fused aggregation technique, which utilizes kernel fusion to reduce redundant memory accesses during the forward and backward aggregation passes. We apply these techniques to state-of-the-art PointNet-based models and evaluate their performance on NVIDIA RTX 3090 GPU. Our experimental results demonstrate 2.25x training time reduction on average with no accuracy drop.

# 128

Strong Double Blind

SemReg: Semantics Constrained Point Cloud Registration

Sheldon Fung · Xuequan Lu · Dasith de Silva Edirimuni · Wei Pan · Xiao Liu · HONGDONG LI

Despite the recent success of Transformers in point cloud registration, the cross-attention mechanism, while enabling point-wise feature exchange between point clouds, suffers from redundant feature interactions among semantically unrelated regions. Additionally, recent methods rely only on 3D information to extract robust feature representations, while overlooking the rich semantic information in 2D images. In this paper, we propose SemReg, a novel 2D-3D cross-modal framework that exploits semantic information in 2D images to enhance the learning of rich and robust feature representations for point cloud registration. In particular, we design a Gaussian Mixture Semantic Prior that fuses 2D semantic features across RGB frames to reveal semantic correlations between regions across the point cloud pair. Subsequently, we propose the Semantics Guided Feature Interaction module that uses this prior to emphasize the feature interactions between the semantically similar regions while suppressing superfluous interactions during the cross-attention stage. In addition, we design a Semantics Aware Focal Loss that facilitates the learning of robust features, and a Semantics Constrained Matching module that performs matching only between the regions sharing similar semantics. We evaluate our proposed SemReg on the public indoor (3DMatch) and outdoor (KITTI) datasets, and experimental results show that it produces superior registration performance to state-of-the-art techniques.

# 120

Strong Double Blind

GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding

Changshuo Wang · Meiqing Wu · Siew-Kei Lam · Xin Ning · Shangshu Yu · Ruiping Wang · Weijun Li · Thambipillai Srikanthan

Despite the significant advancements in pre-training methods for point cloud understanding, directly capturing intricate shape information from irregular point clouds without reliance on external data remains a formidable challenge. To address this problem, we propose GPSFormer, an innovative Global Perception and Local Structure Fitting-based transformer, which learns detailed shape information from point clouds with remarkable precision. The core of GPSFormer is the Global Perception Module (GPM) and the Local Structure Fitting Convolution (LSFConv). Specifically, GPM utilizes Adaptive Deformable Graph Convolution (ADGConv) to identify short-range dependencies among similar features in the feature space and employs Multi-Head Attention (MHA) to learn long-range dependencies across all positions within the feature space, ultimately enabling flexible learning of contextual representations. Inspired by Taylor series, we design LSFConv, which learns both low-order fundamental and high-order refinement information from explicitly encoded local geometric structures. Integrating the GPM and LSFConv as fundamental components, we construct GPSFormer, a cutting-edge Transformer that effectively captures global and local structures of point clouds. Extensive experiments validate GPSFormer's effectiveness in three point cloud tasks: shape classification, part segmentation, and few-shot learning.

# 127

Strong Double Blind

Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning

Yuehui Han · Can Xu · Rui Xu · Jianjun Qian · Jin Xie

Self-supervised representation learning on point cloud sequences is a challenging task due to the complex spatio-temporal structure. Most recent attempts aim to train the point cloud sequences representation model by reconstructing the point coordinates or designing frame-level contrastive learning. However, these methods do not effectively explore the information of temporal dimension and global semantics, which are the very important components in point cloud sequences. To this end, in this paper, we propose a novel masked motion prediction and semantic contrast (M2PSC) based self-supervised representation learning framework for point cloud sequences. Specifically, it aims to learn a representation model by integrating three pretext tasks into the same masked autoencoder framework. First, motion trajectory prediction, which can enhance the model's ability to understand dynamic information in point cloud sequences. Second, semantic contrast, which can guide the model to better explore the global semantics of point cloud sequences. Third, appearance reconstruction, which can help capture the appearance information of point cloud sequences. In this way, our method can force the model to simultaneously encode spatial and temporal structure in the point cloud sequences. Experimental results on four benchmark datasets demonstrate the effectiveness of our method.

# 131

RangeLDM: Fast Realistic LiDAR Point Cloud Generation

Qianjiang Hu · Zhimin Zhang · Wei Hu

Autonomous driving demands high-quality LiDAR data, yet the cost of physical LiDAR sensors presents a significant scaling-up challenge. While recent efforts have explored deep generative models to address this issue, they often consume substantial computational resources with slow generation speeds while suffering from a lack of realism. To address these limitations, we introduce RangeLDM, a novel approach for rapidly generating high-quality range-view LiDAR point clouds via latent diffusion models. We achieve this by correcting range-view data distribution for accurate projection from point clouds to range images via Hough voting, which has a critical impact on generative learning. We then compress the range images into a latent space with a variational autoencoder, and leverage a diffusion model to enhance expressivity. Additionally, we instruct the model to preserve 3D structural fidelity by devising a range-guided discriminator. Experimental results on KITTI-360 and nuScenes datasets demonstrate both the robust expressiveness and fast speed of our LiDAR point cloud generation.

# 119

Strong Double Blind

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Tuo FENG · Wenguan Wang · Ruijie Quan · Yi Yang

Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multi-scale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to high-resolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection. Our code will be released.

# 116

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

Yang Miao · Francis Engelmann · Olga Vysotska · Federico Tombari · Marc Pollefeys · Daniel Barath

This paper introduces a novel problem, i.e., the localization of an input image within a multi-modal reference map represented by a database of 3D scene graphs. These graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given the available modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing an object instance) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map embeddings. When images are leveraged, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. The code will be made public.

# 303

Strong Double Blind

Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization

Ming-Yang Ho · Che-Ming Wu · Min-Sheng Wu · Yufeng Jane Tseng

Recent advancements in ultra-high-resolution unpaired image-to-image translation have aimed to mitigate the constraints imposed by limited GPU memory through patch-wise inference. Nonetheless, existing methods often compromise between the reduction of noticeable tiling artifacts and the preservation of color and hue contrast, attributed to the reliance on global image- or patch-level statistics in the instance normalization layers. In this study, we introduce a Dense Normalization (DN) layer designed to estimate pixel-level statistical moments. This approach effectively diminishes tiling artifacts while concurrently preserving local color and hue contrasts. To address the computational demands of pixel-level estimation, we further propose an efficient interpolation algorithm. Moreover, we invent a parallelism strategy that enables the DN layer to operate in a single pass. Through extensive experiments, we demonstrate that our method surpasses all existing approaches in performance. Notably, our DN layer is hyperparameter-free and can be seamlessly integrated into most unpaired image-to-image translation frameworks without necessitating retraining. Overall, our work paves the way for future exploration in handling images of arbitrary resolutions within the realm of unpaired image-to-image translation. Code is available at: \url{https://github.com/Kaminyou/Dense-Normalization}.

# 326

Strong Double Blind

Adaptive Annealing for Robust Averaging

Sidhartha Chitturi · Venu Madhav Govindu

Graduated Non-Convexity (GNC) or Annealing is a popular technique in robust cost minimization for its ability to converge to good local minima irrespective of initialization. However, the conventional use of a fixed annealing scheme in GNC often leads to a poor efficiency vs accuracy tradeoff. To address it, previous approaches introduced adaptive annealing but lacked scalability for large optimization problems. \textit{Averaging} of pairwise relative observations is one such class of problems, defined on a graph, wherein a large number of variables (nodes) are estimated given the pairwise observations (edges). In this paper, we present a novel adaptive GNC framework tailored for averaging problems in computer vision, operating on vector spaces. Leveraging insights from graph Laplacian matrices inherent in such problems, our approach imparts scalability to the principled GNC framework. Our method demonstrates superior accuracy in vector averaging and translation averaging, while maintaining efficiency comparable to baselines.

# 343

Strong Double Blind

Resolving Scale Ambiguity in Multi-view 3D Reconstruction using Dual-Pixel Sensors

Kohei Ashida · Hiroaki Santo · Fumio Okura · Yasuyuki Matsushita

Multi-view 3D reconstruction, namely structure-from-motion and multi-view stereo, is an essential component in 3D computer vision. In general, multi-view 3D reconstruction suffers from unknown scale ambiguity unless a reference object of known size is recorded together with the scene, or the camera poses are pre-calibrated. In this paper, we show that multi-view images recorded by a dual-pixel (DP) sensor allow us to automatically resolve the scale ambiguity without requiring a reference object or pre-calibration. Specifically, the observed defocus blurs in DP images provide sufficient information for determining the scale when paired together with the depth maps (up to scale) recovered from the multi-view 3D reconstruction. Based on this observation, we develop a simple yet effective linear solution method to determine the absolute scale in multi-view 3D reconstruction. Experiments demonstrate the effectiveness of the proposed method with diverse scenes recorded with different cameras/lenses. Our code and data will be released upon acceptance.

# 318

Strong Double Blind

Consistent 3D Line Mapping

Xulong Bai · Hainan Cui · Shuhan Shen

We address the problem of reconstructing 3D line segments along with line tracks from multiple views with known camera poses. The basic pipeline is first generating 3D line segment proposals for each 2D line segment, then selecting the best proposals, merging them to produce 3D line segments and line tracks, and finally performing non-linear optimization. Our key contributions are focused on exploring and alleviating the inconsistency problems in classical approaches. In the best proposal selection, we analyze the inherent inconsistency problem of support relationships from 2D to 3D determined during proposal evaluation using multiple views and propose an iterative algorithm to handle it. In line track building, we impose 2D collinearity constraints to enhance the consistency of the elements in each line track. In optimization, we introduce coplanarity constraints and jointly optimize points, lines, planes, and vanishing points, enhancing the consistency of the structure of the line map. Experimental results demonstrate that our emphasis on consistency enables our line maps to achieve state-of-the-art completeness and accuracy, while also generating longer and more robust line tracks. The full implementation of our work will be released.

# 142

Strong Double Blind

Robust Incremental Structure-from-Motion with Hybrid Features

Shaohui Liu · Yidan Gao · Tianyi Zhang · Rémi Pautrat · Johannes L Schönberger · Viktor Larsson · Marc Pollefeys

Structure-from-Motion (SfM) has become a ubiquitous tool for camera calibration and scene reconstruction with many downstream applications in computer vision and beyond. While the state-of-the-art SfM pipelines have reached a high level of maturity in well-textured and well-configured scenes over the last decades, they still fall short of robustly solving the SfM problem in challenging scenarios. In particular, weakly textured scenes and poorly constrained configurations oftentimes cause catastrophic failures or large errors for the primarily keypoint-based pipelines. In these scenarios, line segments are often abundant and can offer complementary geometric constraints. Their large spatial extent and typically structured configurations lead to stronger geometric constraints as compared to traditional keypoint-based methods. In this work, we introduce an incremental SfM system that, in addition to points, leverages lines and their structured geometric relations. Our technical contributions span the entire pipeline (mapping, triangulation, registration) and we integrate these into a comprehensive end-to-end SfM system that we share as an open-source software with the community. We also present the first analytical method to propagate uncertainties for 3D optimized lines via sensitivity analysis. Experiments show that our system is consistently more robust and accurate compared to the widely used point-based state of the art in SfM -- achieving richer maps and more precise camera registrations, especially under challenging conditions. In addition, our uncertainty-aware localization module alone is able to consistently improve over the state of the art under both point-alone and hybrid setups.

# 281

Strong Double Blind

Gravity-aligned Rotation Averaging with Circular Regression

Linfei Pan · Marc Pollefeys · Daniel Barath

Reconstructing a 3D scene from unordered images is pivotal in computer vision and robotics, with applications spanning crowd-sourced mapping and beyond. While global Structure-from-Motion (SfM) techniques are scalable and fast, they often compromise on accuracy. To address this, we introduce a principled approach that integrates gravity direction into the rotation averaging phase of global pipelines, enhancing camera orientation accuracy and reducing the degrees of freedom. This additional information is commonly available in recent consumer devices, such as smartphones, mixed-reality devices and drones, making the proposed method readily accessible. Rooted in circular regression, our algorithm has similar convergence guarantees as linear regression. It also supports scenarios where only a subset of cameras have known gravity. Additionally, we propose a mechanism to refine error-prone gravity. We achieve state-of-the-art accuracy on four large-scale datasets. Particularly, the proposed method improves upon the SfM baseline by 13 AUC@$1^\circ$ points, on average, while running eight times faster. It also outperforms the standard planar pose graph optimization technique by 23 AUC@$1^\circ$ points. The code will be made publicly available.

# 327

Strong Double Blind

GeoCalib: Learning Single-image Calibration with Geometric Optimization

Alexander Veicht · Paul-Edouard Sarlin · Philipp Lindenberger · Marc Pollefeys

From a single image, visual cues can help deduce intrinsic and extrinsic camera parameters like the focal length and the gravity direction. This single-image calibration can benefit various downstream applications like image editing and 3D mapping. Current approaches to this problem are based on either classical geometry with lines and vanishing points or on deep neural networks trained end-to-end. The learned approaches are more robust but struggle to generalize to new environments and are less accurate than their classical counterparts. We hypothesize that they lack the constraints that 3D geometry provides. In this work, we introduce GeoCalib, a deep neural network that leverages universal rules of 3D geometry through an optimization process. GeoCalib is trained end-to-end to estimate camera parameters and learns to find useful visual cues from the data. Experiments on various benchmarks show that GeoCalib is more robust and more accurate than existing classical and learned approaches. Its internal optimization estimates uncertainties, which help flag failure cases and benefit downstream applications like visual localization.

# 134

Real-time Holistic Robot Pose Estimation with Unknown States

Shikun Ban · Juling Fan · Xiaoxuan Ma · Wentao Zhu · YU QIAO · Yizhou Wang

Estimating robot pose from RGB images is a crucial problem in computer vision and robotics. While previous methods have achieved promising performance, most of them presume full knowledge of robot internal states, \eg ground-truth robot joint angles. However, this assumption is not always valid in practical situations. In real-world applications such as multi-robot collaboration or human-robot interaction, the robot joint states might not be shared or could be unreliable. On the other hand, existing approaches that estimate robot pose without joint state priors suffer from heavy computation burdens and thus cannot support real-time applications. This work introduces an efficient framework for real-time robot pose estimation from RGB images without requiring known robot states. Our method estimates camera-to-robot rotation, robot state parameters, keypoint locations, and root depth, employing a neural network module for each task to facilitate learning and sim-to-real transfer. Notably, it achieves inference in a single feed-forward pass without iterative optimization. Our approach offers a 12$\times$ speed increase with state-of-the-art accuracy, enabling real-time holistic robot pose estimation for the first time.

# 266

Learning Neural Volumetric Pose Features for Camera Localization

Jingyu Lin · Jiaqi Gu · Bojian Wu · Lubin Fan · Renjie Chen · Ligang Liu · Jieping Ye

We introduce a novel neural volumetric pose feature, termed PoseMap, designed to enhance camera localization by encapsulating the information between images and the associated camera poses. Our framework leverages an Absolute Pose Regression (APR) architecture, together with an augmented NeRF module. This integration not only facilitates the generation of novel views to enrich the training dataset but also enables the learning of effective pose features. Additionally, we extend our architecture for self-supervised online alignment, allowing our method to be used and fine-tuned for unlabelled images within a unified framework. Experiments demonstrate that our method achieves 14.28% and 20.51% performance gain on average in indoor and outdoor benchmark scenes, outperforming existing APR methods with state-of-the-art accuracy.

# 114

Strong Double Blind

LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation

Ruida Zhang · Ziqin Huang · Gu Wang · Chenyangguang Zhang · Yan Di · Xingxing Zuo · Jiwen Tang · Xiangyang Ji

While RGBD-based methods for category-level object pose estimation hold promise, their reliance on depth data limits their applicability in diverse scenarios. In response, recent efforts have turned to RGB-based methods; however, they face significant challenges stemming from the absence of depth information. On one hand, the lack of depth exacerbates the difficulty in handling intra-class shape variation, resulting in increased uncertainty in shape predictions. On the other hand, RGB-only inputs introduce inherent scale ambiguity, rendering the estimation of object size and translation an ill-posed problem. To tackle these challenges, we propose LaPose, a novel framework that models the object shape as the Laplacian mixture model for Pose estimation. By representing each point as a probabilistic distribution, we explicitly quantify the shape uncertainty. LaPose leverages both a generalized 3D information stream and a specialized feature stream to independently predict the Laplacian distribution for each point, capturing different aspects of object geometry. These two distributions are then integrated as a Laplacian mixture model to establish the 2D-3D correspondences, which are utilized to solve the pose via the PnP module. In order to mitigate scale ambiguity, we introduce a scale-agnostic representation for object size and translation, enhancing training efficiency and overall robustness. Extensive experiments on the NOCS datasets validate the effectiveness of LaPose, yielding state-of-the-art performance in RGB-based category-level object pose estimation. Codes and trained models will be released.

# 105

Strong Double Blind

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

Yujia Liang · Zixuan Ye · Wenze Liu · Hao Lu

Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner. Prior arts involve sophisticated designs, e.g, sundry modules for similarity calculation and a two-stage framework, or takes in extra heatmap generation and supervision. We notice that CAPE is essentially a task about feature matching, which can be solved within the attention process. Therefore we first streamline the architecture into a simple baseline consisting of several pure self-attention layers and an MLP regression head---this simplification means that one only needs to consider the attention quality to boost the performance of CAPE. Towards an effective attention process for CAPE, we further introduce two key modules: i) a global keypoint feature perceptor to inject global semantic information into support keypoints, and ii) a keypoint attention refiner to enhance inter-node correlation between keypoints. They jointly form a Simple and strong Category-Agnostic Pose Estimator (SCAPE). Experimental results show that SCAPE outperforms prior arts by 2.2 and 1.3 PCK under 1-shot and 5-shot settings with faster inference speed and lighter model capacity, excelling in both accuracy and efficiency. Code will be open-sourced.

# 117

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Yuchen Yang · Yu Qiao · Xiao Sun

Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. The code will be publicly available upon publication.

# 133

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues

Vandad Davoodnia · Saeed Ghorbani · Marc-André Carbonneau · Alexandre Messier · Ali Etemad

We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields a performance rivaling methods that rely on 3D annotated data, while being the state-of-the-art among methods relying only on 2D supervision.

# 136

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Yongwei Nie · Changzhen Liu · Chengjiang Long · Qing Zhang · Guiqing Li · Hongmin Cai

Besides a 3D mesh, Human Mesh Recovery (HMR) methods usually need to estimate a camera for computing 2D reprojection loss. Previous approaches may encounter the following problem: both the mesh and camera are not} correct but the combination of them can yield a low reprojection loss. To alleviate this problem, we define multiple RoIs (region of interest) containing the same human and propose a multiple-RoI-based HMR method. Our key idea is that with multiple RoIs as input, we can estimate multiple local cameras and have the opportunity to design and apply additional constraints between cameras to improve the accuracy of the cameras and, in turn, the accuracy of the corresponding 3D mesh. To implement this idea, we propose a RoI-aware feature fusion network by which we estimate a 3D mesh shared by all RoIs as well as local cameras corresponding to the RoIs. We observe that local cameras can be converted to the camera of the full image through which we construct a local camera consistency loss as the additional constraint imposed on local cameras. Another benefit of introducing multiple RoIs is that we can encapsulate our network into a contrastive learning framework and apply a contrastive loss to regularize the training of our network. Experiments demonstrate the effectiveness of our multi-RoI HMR method and superiority to recent prior arts.

# 135

MLPHand: Real Time Multi-View 3D Hand Reconstruction via MLP Modeling

Jian Yang · Jiakun Li · Guoming Li · Huaiyu Wu · Zhen Shen · Zhaoxin Fan

Multi-view hand reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed for real-time multi-view single hand reconstruction. MLPHand consists of two primary modules: (1) a lightweight MLP-based Skeleton2Mesh model that efficiently recovers hand meshes from hand skeletons, and (2) a multi-view geometry feature fusion prediction module that enhances the Skeleton2Mesh model with detailed geometric information from multiple views. Experiments on three widely used datasets demonstrate that MLPHand can reduce computational complexity by 90% while achieving comparable reconstruction accuracy to existing state-of-the-art baselines.

# 118

Strong Double Blind

WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation

Tianjian Jiang · Johsan Billingham · Sebastian Müksch · Juan J Zarate · Nicolas Evans · Martin R. Oswald · Marc Pollefeys · Otmar Hilliges · Manuel Kaufmann · Jie Song

We present WorldPose, a novel dataset for advancing research in multi-person, global pose estimation in the wild, featuring footage from the 2022 FIFA World Cup. While previous datasets have primarily focused on local poses, often limited to a single person or in constrained, indoor settings, the infrastructure deployed for this sporting event allows access to multiple fixed and mobile cameras in different stadiums. We exploit the static multi-view setup of HD cameras to recover the 3D player poses and motions with unprecedented accuracy given capture areas of more than 1.75 acres (7k sqm). We then leverage the captured players' motions and field markings to calibrate a moving broadcasting camera. The resulting dataset comprises 88 sequences with more than 2.5 million 3D poses and a total traveling distance of over 120 km. Subsequently, we conduct an in-depth analysis of the SOTA methods for global pose estimation. Our experiments demonstrate that WorldPose challenges existing multi-person techniques, supporting the potential for new research in this area and others, such as sports analysis. All pose annotations (in SMPL format), broadcasting camera parameters and footage will be released for academic research purposes.

# 328

Strong Double Blind

RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency

Ziming Sun · Yuan Liang · Zejun Ma · Tianle Zhang · Linchao Bao · Guiqing Li · Shengfeng He

We introduce RePOSE, a simple yet effective approach for addressing occlusion challenges in the learning of 3D human pose estimation (HPE) from videos. Conventional approaches typically employ absolute depth signals as supervision, which are adept at discernible keypoints but become less reliable when keypoints are occluded, resulting in vague and inconsistent learning trajectories for the neural network. RePOSE overcomes this limitation by introducing spatio-temporal relational depth consistency into the supervision signals. The core rationale of our method lies in prioritizing the precise sequencing of occluded keypoints. This is achieved by using a relative depth consistency loss that operates in both spatial and temporal domains. By doing so, RePOSE shifts the focus from learning absolute depth values, which can be misleading in occluded scenarios, to relative positioning, which provides a more robust and reliable cue for accurate pose estimation. This subtle yet crucial shift facilitates more consistent and accurate 3D HPE under occlusion conditions. The elegance of our core idea lies in its simplicity and ease of implementation, requiring only a few lines of code. Extensive experiments validate that RePOSE not only outperforms existing state-of-the-art methods but also significantly enhances the robustness and precision of 3D HPE in challenging occluded environments.

# 216

Strong Double Blind

An Economic Framework for 6-DoF Grasp Detection

Xiao-Ming Wu · Jia-Feng Cai · Jian-Jian Jiang · Dian Zheng · Yi-Lin Wei · WEISHI ZHENG

Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp performance. To begin with, we discover that the dense supervision is the bottleneck that severely encumbers the entire training overload, meanwhile making the training difficult to converge. To solve the above problem, we first propose an economic supervision paradigm for efficient and effective grasping. This paradigm includes a well-designed supervision selection strategy, selecting key labels basically without ambiguity, and an economic pipeline to enable the training after selection. Furthermore, benefit from the economic supervision, we can focus on a specific grasp, and thus we devise a focal representation module, which comprises an interactive grasp head and a composite score estimation to generate the specific grasp more accurately. Combining all together, the EconomicGrasp framework is proposed. Our extensive experiments show that EconomicGrasp surpasses the SOTA grasp method by about 3AP on average, and with extremely low resource cost, for about 1/4 training time cost, 1/8 memory cost and 1/30 storage cost. Our code will be published.

# 218

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

Kailin Li · Jingbo Wang · Lixin Yang · Cewu Lu · Bo Dai

Generating natural human grasps necessitates consideration of not just object geometry but also semantic information. Solely depending on object shape for grasp generation confines the applications of prior methods in downstream tasks. This paper presents a novel semantic-based grasp generation method, termed SemGrasp, which generates a static human grasp pose by incorporating semantic information into the grasp representation. We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures in accordance with language instructions. A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space. To facilitate the training of SemGrasp, we compile a large-scale, grasp-text-aligned dataset named CapGrasp, featuring over 300k detailed captions and 50k diverse grasps. Experimental findings demonstrate that SemGrasp efficiently generates natural human grasps in alignment with linguistic intentions. Our code, models, and dataset are available publicly at: https://kailinli.github.io/SemGrasp.

# 141

Strong Double Blind

FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation

Jingyi Tang · Gu Wang · Zeyu Chen · Shengquan Li · Xiu Li · Xiangyang Ji

Although methods for estimating the pose of objects in indoor scenes have achieved great success, the pose estimation of underwater objects remains challenging due to difficulties brought by the complex underwater environment, such as degraded illumination, blurring, and the substantial cost of obtaining real annotations. To this end, we introduce FAFA, a Frequency-Aware Flow-Aided self-supervised framework for 6D pose estimation of unmanned underwater vehicles (UUVs). Essentially, we first train a frequency-aware flow-based pose estimator on synthetic data, where an FFT-based augmentation approach is proposed to facilitate the network in capturing domain-invariant features and target domain styles from a frequency perspective. Further, we perform self-supervised training by enforcing flow-aided multi-level consistencies to adapt it to the real-world underwater environment. Our framework relies solely on the 3D model and RGB images, alleviating the need for any real pose annotations or other-modality data like depths. We evaluate the effectiveness of FAFA on common underwater object pose benchmarks and showcase significant performance improvements compared to state-of-the-art methods. Our code will be made publicly available.

# 164

OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations

Yiming Zuo · Jia Deng

Depth completion is the task of generating a dense depth map given an image and a sparse depth map as inputs. It has important applications in various downstream tasks. In this paper, we present OGNI-DC, a novel framework for depth completion. The key to our method is "Optimization-Guided Neural Iterations" (OGNI). It consists of a recurrent unit that refines a depth gradient field and a differentiable depth integrator that integrates the depth gradients into a depth map. OGNI-DC exhibits strong generalization, outperforming baselines by a large margin on unseen datasets and across various sparsity levels. Moreover, OGNI-DC has high accuracy, achieving state-of-the-art performance on the NYUv2 and the KITTI benchmarks. Code is attached for reviewing and will be released if the paper is accepted.

# 137

Strong Double Blind

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

Sungmin Woo · Wonjoon Lee · Woo Jin Kim · Dogyoon Lee · Sangyoun Lee

Self-supervised multi-frame monocular depth estimation relies on the geometric consistency between successive frames under the assumption of a static scene. However, the presence of moving objects in dynamic scenes introduces inevitable inconsistencies, causing misaligned multi-frame feature matching and misleading self-supervision during training. In this paper, we propose a novel framework called ProDepth, which effectively addresses the mismatch problem caused by dynamic objects using a probabilistic approach. We initially deduce the uncertainty associated with static scene assumption by adopting an auxiliary decoder. This decoder analyzes inconsistencies embedded in the cost volume, inferring the probability of areas being dynamic. We then directly rectify the erroneous cost volume for dynamic areas through a Probabilistic Cost Volume Modulation (PCVM) module. Specifically, we derive probability distributions of depth candidates from both single-frame and multi-frame cues, modulating the cost volume by adaptively fusing those distributions based on the inferred uncertainty. Additionally, we present a self-supervision loss reweighting strategy that not only masks out incorrect supervision with high uncertainty but also mitigates the risks in remaining possible dynamic areas in accordance with the probability. Our proposed method excels over state-of-the-art approaches in all metrics on both Cityscapes and KITTI datasets, and demonstrates superior generalization ability on the Waymo Open dataset. Code and pretrained models will be made publicly available.

# 140

Strong Double Blind

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

Bohan Li · Jiajun Deng · Wenyao Zhang · Zhujin Liang · Dalong Du · Xin Jin · Wenjun Zeng

Camera-based 3D semantic scene completion (SSC) is pivotal for predicting complicated 3D layouts with limited 2D image observations. The existing mainstream solutions generally leverage temporal information by roughly stacking history frames to supplement the current frame, such straightforward temporal modeling inevitably diminishes valid clues and increases learning difficulty. To address this problem, we present \textbf{HTCL}, a novel \textbf{H}ierarchical \textbf{T}emporal \textbf{C}ontext \textbf{L}earning paradigm for improving camera-based semantic scene completion. The primary innovation of this work involves decomposing temporal context learning into two hierarchical steps: (a) cross-frame affinity measurement and (b) affinity-based dynamic refinement. Firstly, to separate critical relevant context from redundant information, we introduce the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling. Subsequently, to dynamically compensate for incomplete observations, we adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions. Our method ranks $1^{st}$ on the SemanticKITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU on the OpenOccupancy benchmark. Our code is available in the supplementary material.

# 121

Strong Double Blind

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

Runmin Zhang · Jun Ma · Lun Luo · Beinan Yu · Shu-Jie Chen · Junwei Li · Hui-Liang Shen · Siyuan Cao

We propose a novel unsupervised cross-modal homography estimation framework based on intra-modal self-supervised learning, correlation, and consistent feature map projection, namely SCPNet. The concept of intra-modal self-supervised learning is first presented to facilitate the unsupervised cross-modal homography estimation. The correlation-based homography estimation network and the consistent feature map projection are combined to form the learnable architecture of SCPNet, boosting the unsupervised learning framework. SCPNet is the first to achieve effective unsupervised homography estimation on the satellite-map image pair cross-modal dataset, GoogleMap, under [-32,+32] offset on a 128×128 image, leading the supervised approach MHN by 14.0% of mean average corner error (MACE). We further conduct extensive experiments on several cross-modal/spectral and manually-made inconsistent datasets, on which SCPNet achieves the state-of-the-art (SOTA) performance among unsupervised approaches, and owns 49.0%, 25.2%, 36.4%, and 10.7% lower MACEs than the supervised approach MHN. Source code will be available upon acceptance.

# 330

Strong Double Blind

Reinforcement Learning Meets Visual Odometry

Nico Messikommer · Giovanni Cioffi · Mathias Gehrig · Davide Scaramuzza

Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics. We will release our code upon acceptance.

# 139

Strong Double Blind

Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Qi Zhang · Kaiyi Zhang · Antoni Chan · Hui Huang

Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, existing methods' performance is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map supervision, optimal transport-based point supervision methods have been proposed in the single-image crowd localization tasks, but have not been explored for multi-view crowd localization yet. Thus, in this paper, we propose a novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization. First, we replace the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction. Second, the object-to-camera distance in each view is used to adjust the optimal transport cost of each location further, where the wrong predictions far away from the camera are more heavily penalized. Finally, we propose a strategy to consider all the input camera views in the model loss (M-MVOT) by computing the optimal transport cost for each ground-truth point based on its closest camera. Experiments demonstrate the advantage of the proposed method over density map-based or common Euclidean distance-based optimal transport loss on several multi-view crowd localization datasets.

# 115

Strong Double Blind

Camera-LiDAR Cross-modality Gait Recognition

Wenxuan Guo · Yingping Liang · Zhiyu Pan · Ziheng Xi · Jianjiang Feng · Jie Zhou

Gait recognition is a crucial biometric identification technique. Camera-based gait recognition has been widely applied in both research and industrial fields. Due to the provision of 3D structural information, LiDAR-based gait recognition has also begun to evolve most recently. However, in certain applications, cameras fail to recognize persons, such as in low-light environments and long-distance recognition scenarios, where LiDARs work well. On the other hand, the deployment cost and complexity of LiDAR systems limit its wider application. Therefore, it is essential to consider cross-modality gait recognition between cameras and LiDARs for a broader range of applications. In this work, we propose the first cross-modality gait recognition framework between camera and LiDAR, namely CL-Gait. It employs a two-stream network for feature embedding of both modalities. This poses a challenging recognition task due to the inherent matching between 3D and 2D data, exhibiting significant modality discrepancy. To align the feature spaces of the two modalities, i.e., camera silhouettes and LiDAR points, we propose a contrastive pre-training strategy to mitigate modality discrepancy. To make up for the absence of paired camera-LiDAR data, we also introduce a strategy for generating data on a large scale. This strategy utilizes monocular depth estimated from single RGB images and virtual cameras to generate pseudo point clouds for contrastive pre-training. Extensive experiments show that the cross-modality gait recognition is very challenging but still contains potential and feasibility with our proposed model and pre-training strategy. To the best of our knowledge, this is the first work to address cross-modality gait recognition. The code and dataset are available at https://github.com/GWxuan/CL-Gait.

# 138

TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Autonomous Driving

Cheng Zhao · su sun · Ruoyu Wang · Yuliang Guo · Jun-Jun Wan · Zhou Huang · Xinyu Huang · Yingjie Victor Chen · Liu Ren

Most 3D Gaussian Splatting (3D-GS) based methods for urban scenes initialize 3D Gaussians directly with 3D LiDAR points, which not only underutilizes LiDAR data capabilities but also overlooks the potential advantages of fusing LiDAR with camera data. In this paper, we design a novel tightly coupled LiDAR-Camera Gaussian Splatting~(TCLC-GS) to fully leverage the combined strengths of both LiDAR and camera sensors, enabling rapid, high-quality 3D reconstruction and novel view RGB/depth synthesis. TCLC-GS designs a hybrid explicit (colorized 3D mesh) and implicit (hierarchical octree feature) 3D representation derived from LiDAR-camera data, to enrich the properties of 3D Gaussians for splatting. 3D Gaussian's properties are not only initialized in alignment with the 3D mesh which provides more completed 3D shape and color information, but are also endowed with broader contextual information through retrieved octree implicit features. During the Gaussian Splatting optimization process, the 3D mesh offers dense depth information as supervision, which enhances the training process by learning of a robust geometry. Comprehensive evaluations conducted on the Waymo Open Dataset and nuScenes Dataset validate our method's state-of-the-art~(SOTA) performance. Utilizing a single NVIDIA RTX 3090 Ti, our method demonstrates fast training and achieves real-time RGB and depth rendering at 90 FPS in resolution of 1920X1280 (Waymo), and 120 FPS in resolution of 1600X900 (nuScenes) in urban scenarios.

# 130

Strong Double Blind

3D Single-object Tracking in Point Clouds with High Temporal Variation

Qiao Wu · Kun Sun · Pei An · Mathieu Salzmann · Yanning Zhang · Jiaqi Yang

The high temporal variation of the point clouds is the key challenge of 3D single-object tracking (3D SOT). Existing approaches rely on the assumption that the shape variation of the point clouds and the motion of the objects across neighboring frames are smooth, failing to cope with high temporal variation data. In this paper, we present a novel framework for 3D SOT in point clouds with high temporal variation, called HVTrack. HVTrack proposes three novel components to tackle the challenges in the high temporal variation scenario: 1) A Relative-Pose-Aware Memory module to handle temporal point cloud shape variations; 2) a Base-Expansion Feature Cross-Attention module to deal with similar object distractions in expanded search areas; 3) a Contextual Point Guided Self-Attention module for suppressing heavy background noise. We construct a dataset with high temporal variation (KITTI-HV) by setting different frame intervals for sampling in the KITTI dataset. On the KITTI-HV with 5 frame intervals, our HVTrack surpasses the state-of-the-art tracker CXTracker by 11.3%/15.7% in Success/Precision.

# 132

LISO: Lidar-only Self-Supervised 3D Object Detection

Stefan Baur · Frank Moosmann · Andreas Geiger

3D object detection is one of the most important components in any Self-Driving stack, but current state-of-the-art (SOTA) lidar object detectors require costly & slow manual annotation of 3D bounding boxes to perform well. Recently, several methods emerged to generate pseudo ground truth without human supervision, however, all of these methods have various drawbacks: Some methods require sensor rigs with full camera coverage and accurate calibration, partly supplemented by an auxiliary optical flow engine. Others require expensive high-precision localization to find objects that disappeared over multiple drives. We introduce a novel self-supervised method to train SOTA lidar object detection networks which works on unlabeled sequences of lidar point clouds only, which we call trajectory-regularized self-training. It utilizes a SOTA self-supervised lidar scene flow network under the hood to generate, track, and iteratively refine pseudo ground truth. We demonstrate the effectiveness of our approach for multiple SOTA object detection networks across multiple real-world datasets. Code will be released upon acceptance of this paper.

# 144

Strong Double Blind

MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection

Youngmin Oh · Hyung-Il Kim · Seong Tae Kim · Jung Uk Kim

Monocular 3D object detection is an important challenging task in autonomous driving. Existing methods mainly focused on performing 3D detection in ideal weather conditions, characterized by scenarios with clear and optimal visibility. However, a challenge like autonomous driving requires the ability to handle changes in weather conditions (e.g., foggy), not just clear weather. To do this, we propose MonoWAD, a novel weather-robust monocular 3D object detector with a weather-adaptive diffusion model. We introduce two components: (1) the weather codebook to memorize the knowledge of the clear weather and generate a weather-reference feature for any input, and (2) the weather-adaptive diffusion model to enhance the feature representation of the input feature by incorporating a weather-reference feature. This serves an attention role in indicating how much improvement is needed for the input feature according to the weather conditions. For this purpose, we introduce weather-adaptive enhancement loss to enhance the feature representation under both clear and foggy weather conditions. Extensive experiments on monocular images under various weather conditions, MonoWAD achieves weather-robust monocular 3D object detection. Code and dataset will be released after the review process is over.

# 143

Strong Double Blind

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

Shaohong Wang · Lu Bin · Xinyu Xiao · Zhiyu Xiang · Hangguan Shan · Eryun Liu

Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving in recent years. However, current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images. This severely impedes the development of budget-constrained collaborative systems and the exploitation of the advantages offered by the camera modality. This work proposes an instance-level fusion transformer for visual collaborative perception (IFTR), which enhances the detection performance of camera-only collaborative perception systems through the communication and sharing of visual features. To capture the visual information from multiple agents, we design an instance feature aggregation that interacts with the visual features of individual agents using predefined grid-shaped bird eye view (BEV) queries, generating more comprehensive and accurate BEV features. Additionally, we devise a cross-domain query adaptation as a heuristic to fuse 2D priors, implicitly encoding the candidate positions of targets. Furthermore, IFTR optimizes communication efficiency by sending instance-level features, achieving an optimal performance-bandwidth trade-off. We evaluate the proposed IFTR on a real dataset, DAIR-V2X, and two simulated datasets, OPV2V and V2XSet, achieving performance improvements of 57.96%, 9.23% and 12.99% in AP70 metrics compared to the previous SOTAs, respectively. Extensive experiments demonstrate the superiority of IFTR and the effectiveness of its key components. The code and model weights will be released.

# 329

MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty

Tim Broedermann · David Brüggemann · Christos Sakaridis · Kevin Ta · Odysseas Liagouris · Jason Corkill · Luc Van Gool

Achieving level-5 driving automation in autonomous vehicles necessitates a robust semantic visual perception system capable of parsing data from different sensors across diverse conditions. However, existing semantic perception datasets often lack important non-camera modalities typically used in autonomous vehicles, or they do not exploit such modalities to aid and improve semantic annotations in challenging conditions. To address this, we introduce MUSES, the MUlti-SEnsor Semantic perception dataset for driving in adverse conditions under increased uncertainty. MUSES includes synchronized multimodal recordings with 2D panoptic annotations for 2500 images captured under diverse weather and illumination. The dataset integrates a frame camera, a lidar, a radar, an event camera, and an IMU/GNSS sensor. Our new two-stage panoptic annotation protocol captures both class-level and instance-level uncertainty in the ground truth and enables the novel task of uncertainty-aware panoptic segmentation we introduce, along with standard semantic and panoptic segmentation. MUSES proves both effective for training and challenging for evaluating models under diverse visual conditions, and it opens new avenues for research in multimodal and uncertainty-aware dense semantic perception. Our dataset and benchmark are publicly available at https://muses.vision.ee.ethz.ch/

# 167

Reliability in Semantic Segmentation: Can We Use Synthetic Data?

Thibaut Loiseau · Tuan Hung Vu · Mickael Chen · Patrick Pérez · MATTHIEU CORD

Assessing the robustness of perception models to covariate shifts and their ability to detect out-of-distribution (OOD) inputs is crucial for safety-critical applications such as autonomous vehicles. By nature of such applications, however, the relevant data is difficult to collect and annotate. In this paper, we show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. By fine-tuning Stable Diffusion with only in-domain data, we perform zero-shot generation of visual scenes in OOD domains or inpainted with OOD objects. This synthetic data is employed to evaluate the robustness of pretrained segmenters, thereby offering insights into their performance when confronted with real edge cases. Through extensive experiments, we demonstrate a high correlation between the performance of models when evaluated on our synthetic OOD data and when evaluated on real OOD inputs, showing the relevance of such virtual testing. Furthermore, we demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.

# 169

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Yuru Jia · Lukas Hoyer · Shengyu Huang · Tianfu Wang · Luc Van Gool · Konrad Schindler · Anton Obukhov

Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Third, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods compared to the previous state-of-the-art methods. The source code and the generated dataset will be released with the paper.

# 159

Fully Sparse 3D Occupancy Prediction

Haisong Liu · Yang Chen · Haiguang Wang · Zetong Yang · Tianyu Li · Jia Zeng · Li Chen · Hongyang Li · Limin Wang

Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from visual inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with 2D features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along depths raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15, SparseOcc continuously improves its performance to 35.1 RayIoU without whistles and bells. Code will be made available.

# 158

EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding

Wenhua Wu · Qi Wang · Guangming Wang · Junping Wang · Tiankun Zhao · Yang Liu · Dongchao Gao · Zhe Liu · Hesheng Wang

Road surface reconstruction plays a vital role in autonomous driving systems, enabling road lane perception and high-precision mapping. Recently, neural implicit encoding has achieved remarkable results in scene representation, particularly in the realistic rendering of scene textures. However, it faces challenges in directly representing geometric information for large-scale scenes. To address this, we propose EMIE-MAP, a novel method for large-scale road surface reconstruction based on explicit mesh and implicit encoding. The road geometry is represented using explicit mesh, where each vertex stores implicit encoding representing the color and semantic information. To overcome the difficulty in optimizing road elevation, we introduce a trajectory-based elevation initialization and an elevation residual learning method based on Multi-Layer Perceptron (MLP). Additionally, by employing implicit encoding and multi-camera color MLPs decoding, we achieve separate modeling of scene physical properties and camera characteristics, allowing surround-view reconstruction compatible with different camera models. Our method achieves remarkable road surface reconstruction performance in a variety of real-world challenging scenarios.

# 334

Strong Double Blind

Continuity Preserving Online CenterLine Graph Learning

Yunhui Han · Kun Yu · Zhiwei Li

Lane topology, which is usually modeled by a centerline graph, is essential for high-level autonomous driving. For a high-quality graph, both topology connectivity and spatial continuity of centerline segments are critical. However, most of existing approaches pay more attention to connectivity while neglect the continuity. Such kind of centerline graph usually cause problem to planning of autonomous driving. To overcome this problem, we present an end-to-end network, CGNet, with three key modules: 1) Junction Aware Query Enhancement module, which provides positional prior to accurately predict junction points; 2) Bézier Space Connection module, which enforces continuity constraints on any two topologically connected segments in a Bézier space; 3) Iterative Topology Refinement module, which is a graph-based network with memory to iteratively refine the predicted topological connectivity. CGNet achieves state-of-the-art performance on both nuScenes and Argoverse2 datasets. Our code will be publicly available.

# 163

FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving

Xingtai Gui · Tengteng Huang · Haonan Shao · Haotian Yao · Chi Zhang

The future instance prediction from a Bird's Eye View(BEV) perspective is a vital component in autonomous driving, which involves future instance segmentation and instance motion prediction. Existing methods usually rely on a redundant and complex pipeline which requires multiple auxiliary outputs and post-processing procedures. Moreover, estimated errors on each of the auxiliary predictions will lead to degradation of the prediction performance. In this paper, we propose a simple yet effective fully end-to-end framework named Future Instance Prediction Transformer(FipTR), which views the task as BEV instance segmentation and prediction for future frames. We propose to adopt instance queries representing specific traffic participants to directly estimate the corresponding future occupied masks, and thus get rid of complex post-processing procedures. Besides, we devise a flow-aware BEV predictor for future BEV feature prediction composed of a flow-aware deformable attention that takes backward flow guiding the offset sampling. A novel future instance matching strategy is also proposed to further improve the temporal coherence. Extensive experiments demonstrate the superiority of FipTR and its effectiveness under different temporal BEV encoders. The code will be released.

# 161

Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)

Qifeng Li · Xiaosong Jia · Shaobo Wang · Junchi Yan

Real-world autonomous driving (AD) especially urban driving involves many corner cases. The lately released AD simulator CARLA v2 adds 39 common events in the driving scene, and provides more quasi-realistic testbed compared to CARLA v1. It poses new challenges to the community and so far no literature has reported any success on the new scenarios in V2 as existing works mostly have to rely on specific rules for planning yet they cannot cover the more complex cases in CARLA v2. In this work, we take the initiative of directly training a planner and the hope is to handle the corner cases flexibly and effectively, which we believe is also the future of AD. To our best knowledge, we develop the first model-based RL method named Think2Drive for AD, with a world model to learn the transitions of the environment, and then it acts as a neural simulator to train the planner. This paradigm significantly boosts the training efficiency due to the low dimensional state space and parallel computing of tensors in the world model. As a result, Think2Drive is able to run in an expert-level proficiency in CARLA v2 within 3 days of training on a single A6000 GPU, and to our best knowledge, so far there is no reported success (100\% route completion) on CARLA v2. We also propose CornerCaseRepo, a benchmark that supports the evaluation of driving models by scenarios. Additionally, we propose a new and balanced metric to evaluate the performance by route completion, infraction number, and scenario density, so that the driving score could give more information about the actual driving performance.

# 162

Strong Double Blind

Solving Motion Planning Tasks with a Scalable Generative Model

Yihan Hu · Siqi Chai · Zhening Yang · Jingyu Qian · Kun Li · Wenxin Shao · Haichao Zhang · Wei Xu · Qiang Liu

As autonomous driving systems being deployed to millions of vehicles, there is a pressing need of improving the system's scalability, safety and reducing the engineering cost. A realistic, scalable and practical simulator of the driving world is highly desired. In this paper, we present an efficient solution based on generative models which learns the dynamics of the driving scenes. With this model, we can not only simulate the diverse futures of a given driving scenario but also generate a variety of driving scenarios conditioned on various prompts. Our innovative design allows the model to operate in both full-Autoregressive and partial-Autoregressive modes, significantly improving inference and training speed without sacrificing generative capability. This efficiency makes it ideal to be used as an online reactive environment for reinforcement learning, an evaluator for planning policies, and a high-fidelity simulator for testing. We have evaluated our model against two real-world datasets: the Waymo motion dataset and the nuPlan dataset. On the simulation realism and scene generation benchmark, our model achieves the state-of-the-art performance. And in the planning benchmarks, our planner outperforms the prior arts. We conclude that the proposed generative model may serve as a foundation to a variety of motion planning tasks, including data generation, simulation, planning, and online training. Code will be publicly available.

# 333

Strong Double Blind

Enhanced Motion Forecasting with Visual Relation Reasoning

Sungjune Kim · Hadam Baek · Seunggwan Lee · Hyung-gun Chi · Hyerin Lim · Jinkyu Kim · Sangpil Kim

In this work, we emphasize and demonstrate the importance of visual relation learning for motion forecasting task in autonomous driving (AD). Since exploiting the benefits of RGB images in the existing vision-based joint perception and prediction (PnP) networks is limited in the perception stage, we delve into how the explicit utilization of the visual semantics in motion forecasting can enhance its performance. Specifically, this work proposes ViRR (Visual Relation Reasoning), which aims to provide the prediction module with complex visual reasoning of relationships among scene agents. To achieve this, we construct a novel visual scene graph, where the pairwise visual relations are first aggregated as each agent's node feature. Then, the relations of the nodes are learned via our proposed higher-order relation reasoning method, which leverages the consecutive powers of the graph adjacency matrix. As a result, the extracted complex visual interrelations between the scene agents enable precise forecasting and provide explainable reasons for the model prediction. The proposed module is fully differentiable and thus can be easily applied to any existing vision-based PnP networks. We evaluate the motion forecasting performance of ViRR with challenging nuScenes benchmark and demonstrate its high necessity.

# 191

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Ming Hu · Peng Xia · Lin Wang · Siyuan Yan · Feilong Tang · zhongxing xu · Yimin Luo · Kaimin Song · Jurgen Leitner · Xuelian Cheng · Jun Cheng · Chi Liu · Kaijing Zhou · Ge Zongyuan

Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase categories, and absence of time-localized annotations. These limitations impede action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations. 2) Sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability. 3) Time-localized annotations, facilitating temporal localization and prediction tasks within surgical workflows. With approximately 205 hours of surgical videos, OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark. Code and dataset are available at: \url{https://minghu0830.github.io/OphNet-benchmark/}.

# 172

Strong Double Blind

Event-Aided Time-To-Collision Estimation for Autonomous Driving

Jinghang Li · Bangyan Liao · Xiuyuan LU · Peidong Liu · Shaojie Shen · Yi Zhou

Predicting a potential collision with leading vehicles is a fundamentally essential functionality of any autonomous/assisted driving system. One bottleneck of existing vision-based solutions is that their updating rate is limited to the frame rate of standard cameras used. In this paper, we present a novel method that estimates the time to collision using a neuromorphic event-based camera, a biologically inspired visual sensor that can sense at exactly the same rate as scene dynamics. The core of the proposed algorithm consists of a two-step approach for efficient and accurate geometric model fitting on event data in a coarse-to-fine manner. The first step is a robust linear solver based on a novel geometric measurement that overcomes the partial observability of event-based normal flow. The second step further refines the resulting model via a spatio-temporal registration process which is formulated as a nonlinear optimization problem. Experiments demonstrate the effectiveness of the proposed method, outperforming other counterparts in terms of efficiency and accuracy.

# 344

Strong Double Blind

Event-based Mosaicing Bundle Adjustment

Shuang Guo · Guillermo Gallego

We tackle the problem of mosaicing bundle adjustment (i.e., simultaneous refinement of camera orientations and scene map) for a purely rotating event camera. We formulate the problem as a regularized non-linear least squares optimization. The objective function is defined using the linearized event generation model in the camera orientations and the panoramic gradient maps of the scene. We show that this BA optimization has an exploitable block-diagonal sparsity structure, so that the problem can be solved efficiently. To the best of our knowledge, this is the first work to leverage such sparsity to speed up the optimization in the context of event-based cameras, without the need to convert events into image-like representations. We evaluate EMBA on both synthetic and real-world datasets to show its effectiveness (50% photometric error decrease), yielding results of unprecedented quality. In addition, we demonstrate EMBA using high spatial resolution event cameras, yielding delicate panoramas in the wild, even without an initial map. We make the source code publicly available.

# 201

Strong Double Blind

Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations

Zipeng Wang · yunfan lu · LIN WANG

Reconstructing intensity frames from event data while maintaining high temporal resolution and dynamic range is crucial for bridging the gap between event-based and the frame-based computer vision. Previous approaches have depended on supervised learning on synthetic data, which lacks interpretability and risk over-fit the setting of the event simulator. Recently, self-supervised learning (SSL) based methods, which primarily utilize per-frame optical flow to estimate intensity via photometric constancy, has been actively investigated. However, they are vulnerable to errors in case of inaccurate optical flow. This paper proposes a novel SSL event-to-video reconstruction approach, dubbed EvINR, which eliminates the need for labeled data or optical flow estimation. Our core idea is to reconstruct intensity frames by directly addressing the event generation model, essentially a partial differential equation (PDE) that describes how events are generated based on the time-varying brightness signals. Specifically, we utilize an implicit neural representation (INR), which takes in spatiotemporal coordinate $(x, y, t)$ and predicts intensity values, to represent the solution of the event generation equation. The INR, parameterized as a fully-connected Multi-layer Perceptron (MLP), can be optimized with its temporal derivatives supervised by events. To make EvINR feasible for online requisites, we propose several acceleration techniques that substantially expedite the training process. Comprehensive experiments demonstrate that our EvINR surpasses previous SSL methods by 38% Mean Squared Error (MSE) and is comparable or superior to SoTA supervised methods.

# 170

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur Labadia · Ruben Martinez-Cantin · Jose J Guerrero · Giovanni Maria Farinella · Antonino Furnari

Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants or human-robot interaction to understand the user’s goals, but there is still room for improvement to perform STA in a precise and reliable way. In this work, we improve the performance of STA predictions with two contributions: 1) We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2) We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant relative Overall Top-5 mAP improvements of up to +45% on Ego4D and +42% on a novel set of curated EPIC-Kitchens STA labels. We will release the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area

# 215

Learning-based Axial Video Motion Magnification

Kwon Byung-Ki · HYUNBIN OH · Kim Jun-Seong · Hyunwoo Ha · Tae-Hyun Oh

Video motion magnification amplifies invisible small motions to be perceptible, which provides humans with a spatially dense and holistic understanding of small motions in the scene of interest. This is based on the premise that magnifying small motions enhances the legibility of motions. In the real world, however, vibrating objects often possess convoluted systems that have complex natural frequencies, modes, and directions. Existing motion magnification often fails to improve legibility since the intricate motions still retain complex characteristics even after being magnified, which likely distracts us from analyzing them. In this work, we focus on improving legibility by proposing a new concept, axial video motion magnification, which magnifies decomposed motions along the user-specified direction. Axial video motion magnification can be applied to various applications where motions of specific axes are critical, by providing simplified and easily readable motion information. To achieve this, we propose a novel Motion Separation Module that enables the disentangling and magnifying of motion representation along axes of interest. Furthermore, we build a new synthetic training dataset for our task that is generalized to real data. Our proposed method improves the legibility of resulting motions along certain axes by adding a new feature: user controllability. In addition, axial video motion magnification is a more generalized concept; thus, our method can be directly adapted to the generic motion magnification and achieves favorable performance against competing methods. The code and dataset are available on our project page: https://axial-momag.github.io/axial momag/.

# 174

Motion Keyframe Interpolation for Any Human Skeleton using Point Cloud-based Human Motion Data Homogenisation

Clinton Mo · Kun Hu · Chengjiang Long · Dong Yuan · Zhiyong Wang

In the realm of character animation workflows, learned keyframe interpolation algorithms have been extensively researched, which necessitates the availability of large motion datasets. However, owing to the different hierarchical skeletal structures, such datasets often lack cross-compatibility between their native motion skeleton and the desired for different applications. To reconfigure motion data to new skeletons, motion re-targeting is essential. Yet, conventional re-targeting methods are incompatible with concurrent animation workflows, while learned methods require the existence of pre-established datasets for new skeletons. In this paper, we propose the first unsupervised learning approach, namely Point Cloud-based Motion Representation Learning (PC-MRL), for re-targeting motions from human motion datasets to any human skeleton with motion keyframe interpolation. PC-MRL consists of a point cloud obfuscation with skeletal sampling and an unsupervised skeleton reconstruction. The point cloud space is geometry-independent to represent 3D pose and motion data, and effectively obscures any skeletal configuration, ensuring the cross-skeleton consistency. In this space, a cross-skeleton K-nearest neighbors loss is devised for unsupervised learning. Moreover, a first-frame offset quaternion is devsied to represent rotations with relative roll for motion interpolation. Comprehensive experiments demonstrate the effectiveness of PC-MRL in motion interpolation without using target skeletal motion data. We also achieved superior reconstruction metrics for re-targeting.

# 214

Strong Double Blind

Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs

Aayam Shrestha · Pan Liu · German Ros · Kai Yuan · Alan Fern

This work focuses on generating realistic, physically-based human behaviors from multi-modal inputs, which may only partially specify the desired motion. For example, the input may come from a VR controller providing arm motion and body velocity, partial key-point animation, computer vision applied to videos, or even higher-level motion goals. This requires a versatile low-level humanoid controller that can handle such sparse, under-specified guidance, seamlessly switch between skills, and recover from failures. Current approaches for learning humanoid controllers from demonstration data capture some of these characteristics, but none achieve them all. To this end, we introduce the Masked Humanoid Controller (MHC), a novel approach that applies multi-objective imitation learning on augmented and selectively masked motion demonstrations. The training methodology results in an MHC that exhibits the key capabilities of catch-up to out-of-sync input commands, combining elements from multiple motion sequences, and completing unspecified parts of motions from sparse multimodal input. We demonstrate these key capabilities for an MHC learned over a dataset of 87 diverse skills and showcase different multi-modal use cases, including integration with planning frameworks to highlight MHC’s ability to solve new user-defined tasks without any finetuning.

# 222

Strong Double Blind

Scalable Group Choreography via Variational Phase Manifold Learning

Nhat Le · Khoa Do · Xuan Bui · Tuong Do · Erman Tjiputra · Quang Tran · Anh Nguyen

Generating group dance motion from the music is a challenging task with several industrial applications. Although several methods have been proposed to tackle this problem, most of them prioritize optimizing the fidelity in dancing movement, constrained by predetermined dancer counts in datasets. This limitation impedes adaptability to real-world applications. Our study addresses the scalability problem in group choreography while preserving naturalness and synchronization. In particular, we propose a phase-based variational generative model for group dance generation on learning a generative manifold. Our method achieves high-fidelity group dance motion and enables the generation with an unlimited number of dancers while consuming only a minimal and constant amount of memory. The intensive experiments on two public datasets show that our proposed method outperforms recent state-of-the-art approaches by a large margin and is scalable to a great number of dancers beyond the training data.

# 223

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Zhikai Zhang · Yitang Li · Haofeng Huang · Mingxian Lin · Li Yi

Human motion synthesis is a fundamental task in computer animation. Despite recent progress in this field utilizing deep learning and motion capture data, existing methods are always limited to specific motion categories, environments, and styles. This poor generalizability can be partially attributed to the difficulty and expense of collecting large-scale and high-quality motion data. At the same time, foundation models trained with internet-scale image and text data have demonstrated surprising world knowledge and reasoning ability for various downstream tasks. Utilizing these foundation models may help with human motion synthesis, which some recent works have superficially explored. However, these methods didn't fully unveil the foundation models' potential for this task and only support several simple actions and environments. In this paper, we for the first time, without any motion data, explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs across any motion task and environment. Our framework can be split into two stages: 1) sequential keyframe generation by utilizing MLLMs as a keyframe designer and animator; 2) motion filling between keyframes through interpolation and motion tracking. Our method can achieve general human motion synthesis for many downstream tasks. The promising results demonstrate the worth of mocap-free human motion synthesis aided by MLLMs and pave the way for future research.

# 225

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

Jinpeng Liu · Wenxun Dai · Chunyu Wang · Yiji Cheng · Yansong Tang · Xin Tong

Conventional text-to-motion generation methods are usually trained on limited text-motion pairs, making them hard to generalize to open-vocabulary scenarios. Some works use the CLIP model to align the motion space and the text space, aiming to enable motion generation from natural language motion descriptions. However, they are still constrained to generate limited and unrealistic in-place motions. To address these issues, we present a divide-and-conquer framework named PRO-Motion, Plan, postuRe and gO for text-to-Motion generation, which consists of three modules as motion planner, posture-diffuser and go-diffuser. The motion planner instructs Large Language Models (LLMs) to generate a sequence of scripts describing the key postures in the target motion. Differing from natural languages, the scripts can describe all possible postures following very simple text templates. This significantly reduces the complexity of posture-diffuser, which transforms a script to a posture, paving the way for open-vocabulary text-to-motion generation. Finally, the go-diffuser, implemented as another diffusion model, not only increases the motion frames but also estimates the whole-body translations and rotations for all postures, resulting in more dynamic motions. Experimental results have shown the superiority of our method with other counterparts, and demonstrated its capability of generating diverse and realistic motions from complex open-vocabulary prompts such as “Experiencing a profound sense of joy”.

# 228

Drag Anything: Motion Control for Anything using Entity Representation

Weijia Wu · Zhuang Li · Yuchao Gu · Rui Zhao · Yefei He · Junhao Zhang · Mike Zheng Shou · Yan Li · Tingting Gao · Zhang Di

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more user-friendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw a line (trajectory) during interaction. Secondly, our entity representation serves as an open-domain embedding capable of representing any object, enabling the control of motion for diverse entities, including background. Lastly, our entity representation allows simultaneous and distinct motion control for multiple objects. Extensive experiments demonstrate that our DragAnything enables precise and flexible control over the motion of any object in video generation.

# 209

Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores

Lucas Goncalves · Prashant Mathur · Chandrashekhar Lavania · Metehan Cekic · Marcello Federico · Kyu Han

Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately, there is a lack of metrics that offer a quantitative and interpretable measure of audio-visual synchronization for videos "in the wild". To address this gap, we first created a large scale human annotated dataset (100+ hrs) representing nine types of synchronization errors in audio-visual content and how human perceive them. We then developed a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization. We validate PEAVS using a newly generated dataset, achieving a Pearson correlation of 0.79 at the set level and 0.54 at the clip level when compared to human labels. In our experiments, we observe a relative gain 50% over a natural extension of Fréchet based metrics for Audio-Visual synchrony, confirming PEAVS’ efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos "in the wild".

# 230

Audio-Synchronized Visual Animation

Lin Zhang · Shentong Mo · Yijing Zhang · Pedro Morgado

Current visual generation methods can produce high-quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally-synchronized image animations. We introduce Audio-Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips across multiple classes. To this end, we present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio-visual events across 15 categories. We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios. Extensive evaluations validate AVSync15 as a reliable benchmark for synchronized generation and demonstrate our model's superior performance. We further explore AVSyncD's potential in a variety of audio-synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. We hope our established benchmark can open new avenues for controllable visual generation.

# 227

Strong Double Blind

E.T. the Exceptional Trajectory: Text-to-camera-trajectory generation with character awareness

Robin Courant · Nicolas Dufour · Xi WANG · Marc Christie · Vicky Kalogeiton

Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, and in particular camera placement and movements over time. Crafting compelling camera motion trajectories remains a complex iterative process even for skilful artists. While recent work has been focusing on speeding up the process, current solutions remain limited in describing and generating complex and content-aware camera trajectories, especially for movies, where the story evolves around moving characters and hence the camera follows them. To alleviate this, in this paper, we propose a diffusion-based approach, named DIRECTOR, which generates complex compositions of camera trajectories from high-level textual inputs that describe the relation and synchronisation between the camera and characters. Furthermore, we propose a dataset called Exceptional Trajectories (E.T.) with camera motion trajectories along with textual captions with descriptive information of camera and character motion. To our knowledge, this is the first movie dataset of its kind. For proper evaluation, we also provide a robust and accurate language trajectory feature representation. Extensive experiments and analysis show that DIRECTOR successfully leverages both the caption and camera trajectories and sets the new state of the art on this task. Our work represents a significant advancement in democratizing the art of cinematography for amateur and experienced users.

# 235

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Rui Zhao · Yuchao Gu · Jay Zhangjie Wu · Junhao Zhang · Jiawei Liu · Weijia Wu · Jussi Keppo · Mike Zheng Shou

Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. Adaptation methods have been developed for customizing appearance like subject or style, yet under-explored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.

# 239

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Yuwei Guo · Ceyuan Yang · Anyi Rao · Maneesh Agrawala · Dahua Lin · Bo Dai

The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available.

# 234

Object-Centric Diffusion for Efficient Video Editing

Kumara Kahatapitiya · Adil Karjauv · Davide Abati · Fatih Porikli · Yuki M Asano · Amirhossein Habibian

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.

# 232

Strong Double Blind

GroupDiff: Diffusion-based Group Portrait Editing

Yuming Jiang · Nanxuan Zhao · Qing Liu · Krishna Kumar Singh · Shuai Yang · Chen Change Loy · Ziwei Liu

Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there are no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art editing performance compared to existing methods. Notably, GroupDiff offers excellent controllability for editing as well as maintains the fidelity of the original group photos.

# 263

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Ruibin Li · Ruihuang Li · Song Guo · Yabin Zhang

Text-driven diffusion models have significantly advanced image editing performance by using text prompts as inputs. One crucial step in text-driven image editing is to invert the original image into a latent noise code conditioned on the source prompt. While previous methods have achieved promising results by refactoring the image synthesizing process, the inverted latent noise code is tightly coupled with the source prompt, limiting the image editability by target text prompts. To address this issue, we propose a novel method called Source Prompt Disentangled Inversion (SPDInv). It aims at reducing the impact of source prompt, thereby enhancing the text-driven image editing performance by employing diffusion models. To make the inverted noise code independent of the given source prompt as much as possible, we indicate that the iterative inversion process should satisfy a fixed-point constraint. Consequently, we transform the inversion problem into a searching problem to find the fixed-point solution, and utilize the pre-trained diffusion models to facilitate the searching process. The experimental results show that our proposed SPDInv method can effectively mitigate the conflicts between the target editing prompt and the source prompt, leading to a significant reduction in editing artifacts. Furthermore, in addition to text-driven image editing, with SPDInv we can easily adapt the customized image generation methods to localized editing tasks with promising performance.

# 272

Strong Double Blind

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Vadim Titov · Madina Khalmatova · Alexandra Ivanova · Dmitry P Vetrov · Aibek Alanov

Despite recent advances in large-scale text-to-image generative models, manipulating real images with these models remains a challenging problem. The main limitations of existing editing methods are that they either fail to perform with consistent quality on a wide range of image edits, or require time-consuming hyperparameter tuning or fine-tuning of the diffusion model to preserve the image-specific appearance of the input image. Most of these approaches utilize source image information via intermediate feature caching which is inserted in generation process as itself. However, such technique produce feature misalignment of the model that leads to inconsistent results. We propose a novel approach that is built upon modified diffusion sampling process via guidance mechanism. In this work, we explore self-guidance technique to preserve the overall structure of the input image and its local regions appearance that should not be edited. In particular, we explicitly introduce layout preserving energy functions that are aimed to save local and global structures of the source image. Additionally, we propose a noise rescaling mechanism that allows to preserve noise distribution by balancing the norms of classifier-free guidance and our proposed guiders during generation. It leads to more consistent and better editing results. Such guiding approach does not require fine-tuning diffusion model and exact inversion process. As a result, the proposed method provides a fast and high quality editing mechanism. In our experiments, we show through human evaluation and quantitative analysis that the proposed method allows to produce desired editing which is more preferable by the human and also achieves a better trade-off between editing quality and preservation of the original image.

# 258

Strong Double Blind

Towards compact reversible image representations for neural style transfer

Xiyao Liu · Siyu Yang · Jian Zhang · Gerald Schaefer · Jiya Li · Xunli FAN · Songtao Wu · Hui Fang

Arbitrary neural style transfer aims to stylise a content image by referencing a provided style image. Despite various efforts to achieve both content preservation and style transferability, learning effective representations for this task remains challenging since the redundancy of content and style features leads to unpleasant image artefacts. In this paper, we learn compact neural representations for style transfer motivated from an information theoretical perspective. In particular, we enforce compressive representations across sequential modules of a reversible flow network in order to reduce feature redundancy without losing content preservation capability. We use a Barlow twins loss to reduce channel dependency and thus to provide better content expressiveness, and optimise the Jensen-Shannon divergence of style representations between reference and target images to avoid under- and over-stylisation. We demonstrate the effectiveness of our proposed method in comparison to other state-of-the-art style transfer methods.

# 255

InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser

Xing Cui · Zekun Li · Peipei Li · Huaibo Huang · Xuannan Liu · Zhaofeng He

Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by a few reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style. In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with only a single reference image. Our approach is based on the finding that the inversion noise from a stylized reference image inherently carries the style signal, as evidenced by their non-zero signal-to-noise ratio. We employ DDIM inversion to extract this noise from the reference image and leverage a diffusion model to generate new stylized images from the ``style'' noise. Additionally, the inherent ambiguity and bias of textual prompts impede the precise conveying of style during image inversion. To address this, we devise prompt refinement, which learns a style token assisted by human feedback. Qualitative and quantitative experimental results demonstrate that InstaStyle achieves superior performance compared to current benchmarks. Furthermore, our approach also showcases its capability in the creative task of style combination with mixed inversion noise. We will release the code upon publication.

# 253

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing

Jing Gu · Nanxuan Zhao · Wei Xiong · Qing Liu · Zhifei Zhang · He Zhang · Jianming Zhang · HyunJoon Jung · Yilin Wang · Xin Eric Wang

Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weave captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks.

# 245

When and How do negative prompts take effect?

Yuanhao Ban · Ruochen Wang · Tianyi Zhou · Minhao Cheng · Boqing Gong · Cho-Jui Hsieh

The concept of negative prompts, emerging from conditional generation models like Stable Diffusion, allows users to specify what to exclude from the generated images. Despite the widespread use of negative prompts, their intrinsic mechanisms remain largely unexplored. This paper presents the first comprehensive study to uncover how and when negative prompts take effect. Our extensive empirical analysis identifies two primary behaviors of negative prompts. Delayed Effect: The impact of negative prompts is observed after positive prompts render corresponding content. Deletion Through Neutralization: Negative prompts delete concepts from the generated image through a mutual cancellation effect in latent space with positive prompts. These insights reveal significant potential real-world applications; for example, we demonstrate that negative prompts can facilitate object inpainting with minimal alterations to the background via a simple adaptive algorithm. We believe our findings will offer valuable insights for the community in capitalizing on the potential of negative prompts.

# 262

SPIRE: Semantic Prompt-Driven Image Restoration

Chenyang Qi · Zhengzhong Tu · Keren Ye · Mauricio Delbracio · Peyman Milanfar · Qifeng Chen · Hossein TAlebi

Text-driven diffusion models have become increasingly popular for various image editing tasks, including inpainting, stylization, and object replacement. However, it still remains an open research problem to adopt this language-vision paradigm for more fine-level image processing tasks, such as denoising, super-resolution, deblurring, and compression artifact removal. In this paper, we develop SPIRE, a Text-driven Image Restoration framework that leverages natural language as a user-friendly interface to control the image restoration process. We consider the capacity of text information in two dimensions. First, we use content-related prompts to enhance the semantic alignment, effectively alleviating identity ambiguity in the restoration outcomes. Second, our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength, without the need for explicit task-specific design. In addition, we introduce a novel fusion mechanism that augments the existing ControlNet architecture by learning to rescale the generative prior, thereby achieving better restoration fidelity. Our extensive experiments demonstrate the superior restoration performance of SPIRE compared to the state of the arts, alongside offering the flexibility of text-based control over the restoration effects.

# 259

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Runhui Huang · Kaixin Cai · Jianhua Han · Xiaodan Liang · Renjing Pei · Guansong Lu · Xu Songcen · Wei Zhang · Hang Xu

Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior work directly generates the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. In this paper, we propose a layer-collaborative diffusion model, named \textbf{LayerDiff}, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of layer-wise control applications.

# 254

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Yiming Zhao · Zhouhui Lian

Text-to-Image (T2I) generation methods based on diffusion model have garnered significant attention in the last few years. Although these image synthesis methods produce visually appealing results, they frequently exhibit spelling errors when rendering text within the generated images. Such errors manifest as missing, incorrect or extraneous characters, thereby severely constraining the performance of text image generation based on diffusion models. To address the aforementioned issue, this paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion [27]). Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder and provides more robust text embeddings as conditional guidance. Then, we fine-tune the diffusion model using a large-scale dataset, incorporating local attention control under the supervision of character-level segmentation maps. Finally, by employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. Furthermore, we showcase several potential applications of the proposed UDiffText, including text-centric image synthesis, scene text inpainting, etc. Code and model will be available at *.

# 248

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

Luozhou Wang · Guibao Shen · Wenhang Ge · Guangyong Chen · Yijun Li · Yingcong Chen

Text-to-image diffusion models have advanced towards more controllable generation via supporting various additional conditions (e.g., depth map, bounding box) beyond text. However, these models are learned based on the premise of perfect alignment between the text and extra conditions. If this alignment is not satisfied, the final output could be either dominated by one condition, or ambiguity may arise, failing to meet user expectations. To address this issue, we present a training-free approach called Text-Anchored Score Composition (TASC) to further improve the controllability of existing models when provided with partially aligned conditions. The TASC firstly separates conditions based on pair relationships, computing the result individually for each pair. This ensures that each pair no longer has conflicting conditions. Then we propose an attention realignment operation to realign these independently calculated results via a cross-attention mechanism to avoid new conflicts when combining them back. Both qualitative and quantitative results demonstrate the effectiveness of our approach in handling unaligned conditions, which performs favorably against recent methods and more importantly adds flexibility to the controllable image generation process.

# 256

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Yang Zhang · Tze Tzun Teoh · Wei Hern Lim · Kenji Kawaguchi

Recent advancements in diffusion models have notably improved the perceptual quality of generated images in text-to-image synthesis tasks. However, diffusion models often struggle to produce images that accurately reflect the intended semantics of the associated text prompts. We examine cross-attention layers in diffusion models and observe a propensity for these layers to disproportionately focus on certain tokens during the generation process, thereby undermining semantic fidelity. To address the issue of dominant attention, we introduce attention regulation, a computation-efficient on-the-fly optimization approach at inference time to align attention maps with the input text prompt. Notably, our method requires no additional training or fine-tuning and serves as a plug-in module on a model. Hence, the generation capacity of the original model is fully preserved. We compare our approach with alternative approaches across various datasets, evaluation metrics, and diffusion models. Experiment results show that our method consistently outperforms other baselines, yielding images that more faithfully reflect the desired concepts with reduced computation overhead.

# 247

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Shihao Zhao · Shaozhe Hao · Bojia Zi · Huaizhe Xu · Kwan-Yee K. Wong

Text-to-image generation has made significant advancements with the introduction of text-to-image diffusion models. These models typically consist of a language model that interprets user prompts and a vision model that generates corresponding images. As language and vision models continue to progress in their respective domains, there is a great potential in exploring the replacement of components in text-to-image diffusion models with more advanced counterparts. A broader research objective would therefore be to investigate the integration of any two unrelated language and generative vision models for text-to-image generation. In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and plug-and-play approach without requiring modifications to the original weights of the language and vision models. Our pipeline is compatible with various language models and generative vision models, accommodating different structures. Within this framework, we demonstrate that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality. Extensive evaluations have been conducted to verify the effectiveness of LaVi-Bridge.

# 250

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Saman Motamed · Danda Pani Paudel · Luc Van Gool

Diffusion models have revolutionized generative content creation and text-to-image (T2I) diffusion models in particular have increased the creative freedom of users by allowing scene synthesis using natural language. T2I models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting personalized concepts that go beyond object appearance and style (adjectives and verbs) through natural language, remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding and 2) describing such concepts often extends beyond single word embeddings ("being frozen in ice'', "walking on a tightrope'', etc.) that current methods do not handle. In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline. Additionally, visual question answering using a large language model suggested Lego-generated concepts are better aligned with the text description of the concept.

# 246

Strong Double Blind

LogoSticker: Inserting Logos into Diffusion Models for Customized Generation

Mingkang Zhu · Xi Chen · Zhongdao Wang · Hengshuang ZHAO · Jiaya Jia

Recent advances in text-to-image model customization have underscored the importance of integrating new concepts with a few examples. Yet, these progresses are largely confined to widely recognized subjects, which can be learned with relative ease through models' adequate shared prior knowledge. In contrast, logos, characterized by unique patterns and textual elements, are hard to establish shared knowledge within diffusion models, thus presenting a unique challenge. To bridge this gap, we introduce the task of logo insertion. Our goal is to insert logo identities into diffusion models and enable their seamless synthesis in varied contexts. We present a novel two-phase pipeline LogoSticker to tackle this task. First, we propose the actor-critic relation pre-training algorithm, which addresses the nontrivial gaps in models' understanding of the potential spatial positioning of logos and interactions with other objects. Second, we propose a decoupled identity learning algorithm, which enables precise localization and identity extraction of logos. LogoSticker can generate logos accurately and harmoniously in diverse contexts. We comprehensively validate the effectiveness of LogoSticker over customization methods and large models such as DALLE~3. \href{https://mingkangz.github.io/logosticker}{Project page}.

# 257

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Chaofeng Chen · Annan Wang · Haoning Wu · Liang Liao · Wenxiu Sun · Qiong Yan · Weisi Lin

Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as TexForce. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.

# 260

Strong Double Blind

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Trung Dao · Thuan Nguyen · Thanh Van Le · Duc H Vu · Khoi Nguyen · Cuong Pham · Anh Tran

In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models.

# 252

EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models

Ruoxi Chen · Haibo Jin · Yixin Liu · Jinyin Chen · Haohan Wang · Lichao Sun

Text-to-image diffusion models have emerged as an evolutionary for producing creative content in image synthesis. Based on the impressive generation abilities of these models, instruction-guided diffusion models can edit images with simple instructions and input images. While they empower users to obtain their desired edited images with ease, they have raised concerns about unauthorized image manipulation. Prior research has delved into the unauthorized use of personalized diffusion models; however, this problem of instruction-guided diffusion models remains largely unexplored. In this paper, we first propose a protection method EditShield against unauthorized modifications from such models. Specifically, EditShield works by adding imperceptible perturbations that can shift the latent representation used in the diffusion process, forcing models to generate unrealistic images with mismatched subjects. Our extensive experiments demonstrate EditShield's effectiveness among synthetic and real-world datasets. Besides, we found that EditShield performs robustly against various manipulation settings across editing types and synonymous instruction phrases.

# 251

Implicit Concept Removal of Diffusion Models

Zhili LIU · Kai Chen · Yifan Zhang · Jianhua Han · Lanqing Hong · Hang Xu · ZHENGUO LI · Dit-Yan Yeung · James Kwok

Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. These concepts, termed as the ``implicit concepts'', could be unintentionally learned during training and then be generated uncontrollably during inference. Existing removal methods still struggle to eliminate implicit concepts primarily due to their dependency on the model's ability to recognize concepts it actually can not discern. To address this, we utilize the intrinsic geometric characteristics of implicit concepts and present the Geom-Erasing, a novel concept removal method based on geometric-driven control. Specifically, once an unwanted implicit concept is identified, we integrate the existence and geometric information of the concept into text prompts with the help of an accessible classifier or detector model. Subsequently, the model is optimized to identify and disentangle this information, which is adopted as negative prompts for generation. Moreover, we introduce Implicit Concept Dataset (ICD), a novel image-text dataset imbued with three typical implicit concepts (i.e., QR codes, watermarks, and text), reflecting real-life situations where implicit concepts are easily injected. Geom-Erasing effectively mitigates the generation of implicit concepts, achieving state-of-the-art results on the Inappropriate Image Prompts (I2P) and our challenging Implicit Concept Dataset (ICD) benchmarks.

# 296

NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Yoonwoo Jeong · Jinwoo Lee · Chiheon Kim · Minsu Cho · Doyup Lee

Recent advancements in Novel View Synthesis (NVS) from a single image have produced impressive results by leveraging the generation capabilities of pre-trained Text-to-Image (T2I) models. However, previous NVS approaches require extra optimization to use other plug-and-play image generation modules such as ControlNet and LoRA, as they fine-tune the T2I parameters. In this study, we propose an efficient plug-and-play adaptation module, NVS-Adapter, that is compatible with existing plug-and-play modules without extensive fine-tuning. We introduce target views and reference view alignment to improve the geometric consistency of multi-view predictions. Experimental results demonstrate the compatibility of our NVS-Adapter with existing plug-and-play modules. Moreover, our NVS-Adapter shows superior performance over state-of-the-art methods on NVS benchmarks although it does not fine-tune billions of parameters of the pre-trained T2I models.

# 61

Global Counterfactual Directions

Bartlomiej Sobieski · Przemyslaw Biecek

Despite increasing progress in development of methods for generating visual counterfactual explanations, especially with the recent rise of Denoising Diffusion Probabilistic Models, previous works consider them as an entirely local technique. In this work, we take the first step at globalizing them. Specifically, we discover that the latent space of Diffusion Autoencoders encodes the inference process of a given classifier in the form of global directions. We propose a novel proxy-based approach that discovers two types of these directions with the use of only single image in an entirely black-box manner. Precisely, g-directions allow for flipping the decision of a given classifier on an entire dataset of images, while h-directions further increase the diversity of explanations. We refer to them in general as Global Counterfactual Directions (GCDs). Moreover, we show that GCDs can be naturally combined with Latent Integrated Gradients resulting in a new black-box attribution method, while simultaneously enhancing the understanding of counterfactual explanations. We validate our approach on existing benchmarks and show that it generalizes to real-world use-cases.

# 17

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

Donghoon Ahn · Hyoungwon Cho · Jaewon Min · Jungwoo Kim · Wooseok Jang · SeonHwa Kim · Hyun Hee Park · Kyong Hwan Jin · Seungryong Kim

Diffusion models can generate high-quality samples, but their quality is highly reliant on guidance techniques such as classifier guidance (CG) and classifier-free guidance (CFG), which are inapplicable in unconditional generation. Inspired by the semantic awareness capabilities of self-attention mechanisms, we present Perturbed-Attention Guidance (PAG), a method that enhances the structure of generated samples. This is done by creating degraded output through substituting the self-attention map with an identity matrix so that sampling process can be guided with those samples. As a result, in both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios without additional training. Moreover, PAG significantly improves the performance in downstream tasks where existing guidance cannot be fully utilized, such as inverse problems (super-resolution, deblurring, etc.) and ControlNet with empty prompts.

# 264

Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization

Tao Yang · Rongyuan Wu · Peiran Ren · Xuansong Xie · Yabin Zhang

Diffusion models have demonstrated impressive performance in various image generation, editing, enhancement and translation tasks. In particular, the pre-trained text-to-image stable diffusion models provide a potential solution to the challenging realistic image super-resolution (Real-ISR) and image stylization problems with their strong generative priors. However, the existing methods along this line often fail to keep faithful pixel-wise image structures. If extra skip connections are used to reproduce details, additional training in image space will be required, limiting the application to tasks in latent space such as image stylization. In this work, we propose a pixel-aware stable diffusion (PASD) network to achieve robust Real-ISR and personalized image stylization. Specifically, a pixel-aware cross attention module is introduced to enable diffusion models perceiving image local structures in pixel-wise level, while a degradation removal module is used to extract degradation insensitive features to guide the diffusion process together with image high level information. An adjustable noise schedule is introduced to further improve the image restoration results. By simply replacing the base diffusion model with a stylized one, PASD can generate diverse stylized images without collecting pairwise training data, and by shifting the base model with an aesthetic one, PASD can bring old photos back to life. Extensive experiments in a variety of image enhancement and stylization tasks demonstrate the effectiveness of our proposed PASD approach.

# 265

Strong Double Blind

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation

Zanlin Ni · Yulin Wang · Renping Zhou · Rui Lu · Jiayi Guo · Jinyi Hu · Zhiyuan Liu · Yuan Yao · Gao Huang

Recent studies have demonstrated the effectiveness of token-based methods for visual content generation. As a representative work, non-autoregressive Transformers (NATs) are able to synthesize images with decent quality in a small number of steps. However, NATs usually necessitate configuring a complicated generation policy comprising multiple manually-designed scheduling rules. These heuristic-driven rules are prone to sub-optimality and come with the requirements of expert knowledge and labor-intensive efforts. Moreover, their one-size-fits-all nature cannot flexibly adapt to the diversified characteristics of each individual sample. To address these issues, we propose AdaNAT, a learnable approach that automatically configures a suitable policy tailored for every sample to be generated. In specific, we formulate the determination of generation policies as a Markov decision process. Under this framework, a lightweight policy network for generation can be learned via reinforcement learning. Importantly, we demonstrate that simple reward designs such as FID or pre-trained reward models, may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of policy networks effectively. Comprehensive experiments on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M, validate the effectiveness of AdaNAT. All the code and models will be released after acceptance.

# 14

Strong Double Blind

Beta-Tuned Timestep Diffusion Model

Tianyi Zheng · Peng-Tao Jiang · Ben Wan · Hao Zhang · Jinwei Chen · Jia Wang · Bo Li

Diffusion models have received a lot of attention in the field of generation due to their ability to produce high-quality samples. However, several recent studies indicate that treating all distributions equally in diffusion model training is sub-optimal. In this paper, we conduct an in-depth theoretical analysis of the forward process of diffusion models. Our findings reveal that the distribution variations are non-uniform throughout the diffusion process and the most drastic variations in distribution occur in the initial stages. Consequently, simple uniform timestep sampling strategy fail to align with these properties, potentially leading to sub-optimal training of diffusion models. To address this, we propose the Beta-Tuned Timestep Diffusion Model (B-TTDM), which devises a timestep sampling strategy based on the beta distribution. By choosing the correct parameters, B-TTDM aligns the timestep sampling distribution with the properties of the forward diffusion process. Extensive experiments on different benchmark datasets validate the effectiveness of B-TTDM.

# 292

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Lanqing Guo · Yingqing He · Haoxin Chen · Menghan Xia · Xiaodong Cun · Yufei Wang · Siyu Huang · Yong Zhang · Xintao Wang · Qifeng Chen · Ying Shan · Bihan Wen

Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a 5× training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

# 270

Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

Yeongtak Oh · Jonghyun Lee · Jooyoung Choi · Dahuin Jung · Uiwon Hwang · Sungroh Yoon

Test-time adaptation (TTA) addresses the unforeseen distribution shifts occurring during test time. In TTA, both performance and, memory and time consumption serve as crucial considerations. A recent diffusion-based TTA approach for restoring corrupted images involves image-level updates. However, using pixel space diffusion significantly increases resource requirements compared to conventional model updating TTA approaches, revealing limitations as a TTA method. To address this, we propose a novel TTA method by leveraging a latent diffusion model (LDM) based image editing model and fine-tuning it with our newly introduced corruption modeling scheme. This scheme enhances the robustness of the diffusion model against distribution shifts by creating (clean, corrupted) image pairs and fine-tuning the model to edit corrupted images into clean ones. Moreover, we introduce a distilled variant to accelerate the model for corruption editing using only 4 network function evaluations (NFEs). We extensively validated our method across various architectures and datasets including image and video domains. Our model achieves the best performance with a 100 times faster runtime than that of a diffusion-based baseline. Furthermore, it outpaces the speed of the model updating TTA method based on data augmentation threefold, rendering an image-level updating approach more practical.

# 267

InstructIR: High-Quality Image Restoration Following Human Instructions

Marcos Conde · Gregor Geigle · Radu Timofte

Image restoration is a fundamental problem that involves recovering a high-quality clean image from its degraded observation. All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model. In this work, we present the first approach that uses human-written instructions to guide the image restoration model. Given natural language prompts, our model can recover high-quality images from their degraded counterparts, considering multiple degradation types. Our method, InstructIR, achieves state-of-the-art results on several restoration tasks including image denoising, deraining, deblurring, dehazing, and (low-light) image enhancement. InstructIR improves +1dB over previous all-in-one restoration methods. Moreover, our dataset and results represent a novel benchmark for new research on text-guided image restoration and enhancement.

# 268

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Xuan JU · Xian Liu · Xintao Wang · Yuxuan Bian · Ying Shan · Qiang Xu

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model's learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.

# 295

Strong Double Blind

Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

Jiaqi Xu · Mengyang Wu · Xiaowei Hu · Chi-Wing Fu · Qi Dou · Pheng-Ann Heng

This paper addresses the limitations of existing adverse weather image restoration methods trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework utilizing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clarity and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, employing a dual-step strategy with pseudo-labels generated by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an efficient training strategy to alleviate computational burdens. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art approaches.

# 293

Strong Double Blind

OneRestore: A Universal Restoration Framework for Composite Degradation

Yu Guo · Yuan Gao · Yuxu Lu · Huilin Zhu · Wen Liu · Shengfeng He

In real-world scenarios, image impairments often manifest as composite degradations, presenting a complex interplay of elements such as low light, haze, rain, and snow. Despite this reality, existing restoration methods typically target isolated degradation types, thereby falling short in environments where multiple degrading factors coexist. To bridge this gap, our study proposes a versatile imaging model that consolidates four physical corruption paradigms to accurately represent complex, composite degradation scenarios. In this context, we propose OneRestore, a novel transformer-based framework designed for adaptive, controllable scene restoration. The proposed framework leverages a unique cross-attention mechanism, merging degraded scene descriptors with image features, allowing for nuanced restoration. Our model allows versatile input scene descriptors, ranging from manual text embeddings to automatic extractions based on visual attributes. Our methodology is further enhanced through a composite degradation restoration loss, using extra degraded images as negative samples to fortify model constraints. Comparative results on synthetic and real-world datasets demonstrate OneRestore as a superior solution, significantly advancing the state-of-the-art in addressing complex, composite degradations.

# 271

Strong Double Blind

UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt

Xin Li · Bingchen Li · Yeying Jin · Cuiling Lan · Hanxin Zhu · Yulin Ren · Zhibo Chen

Compressed Image Super-resolution (CSR) aims to simultaneously super-resolve the compressed images and tackle the challenging hybrid distortions caused by compression. However, existing works on CSR usually focus on single compression codec, i.e., JEPG, ignoring the diverse traditional or learning-based codecs in the practical application, e.g., HEVC, VVC, HIFIC, etc. In this work, we propose the first universal CSR framework, dubbed UCIP, with dynamic prompt learning, intending to jointly support the CSR distortions of any compression codecs/modes. Particularly, an efficient dynamic prompt strategy is proposed to mine the content/spatial-aware task-adaptive contextual information for the universal CSR task, using only a small amount of prompts with spatial size 1x1. To simplify contextual information mining, we introduce the novel MLP-like framework backbone for our UCIP by adapting the Active Token Mixer (ATM) to CSR tasks for the first time, where the global information modeling is only taken in horizontal and vertical directions with offset prediction. We also build an all-in-one benchmark dataset for the CSR task by collecting the datasets with the popular 6 diverse traditional and learning-based codecs, including JPEG, HEVC, VVC, HIFIC, etc., resulting in 23 common degradations. Extensive experiments have shown the consistent and excellent performance of our UCIP on universal CSR tasks.

# 291

Strong Double Blind

Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution

Yuehan Zhang · Seungjun Lee · Angela Yao

Standard single-image super-resolution relies on creating paired training data from high-resolution images through static downsampling kernels. However, real-world super-resolution (RWSR) faces unknown degradations in the low-resolution inputs, while paired training data is lacking. Existing methods approach this problem by learning blind general models through complex synthetic augmentation on training inputs; they sacrifice the performance for broader generalization to many possible degradations. We address the unsupervised RWSR from a distillation perspective and introduce a novel pairwise distance distillation framework. Our framework adapts a model specialized in specific synthetic degradations to target real-world degradations by distilling intra- and inter-model distances across the specialized model and an auxiliary generalized model. Experiments on diverse datasets demonstrate that our method significantly enhances fidelity and perceptual quality, surpassing state-of-the-art approaches in RWSR.

# 304

Strong Double Blind

When Fast Fourier Transform Meets Transformer for Image Restoration

xingyu jiang · Xiuhui Zhang · Ning Gao · Yue Deng

Natural images can suffer from various degradation phenomena caused by adverse atmospheric conditions or unique degradation mechanism. Such diversity makes it challenging to design a universal framework for kinds of restoration tasks. Instead of exploring the commonality across different degradation phenomena, existing image restoration methods focus on the modification of network architecture under limited restoration priors. In this work, we first review various degradation phenomena from a frequency perspective as prior. Based on this, we propose an efficient image restoration framework, dubbed SFHformer, which incorporates the Fast Fourier Transform mechanism into Transformer architecture. Specifically, we design a dual domain hybrid structure for multi-scale receptive fields modeling, in which the spatial domain and the frequency domain focuses on local modeling and global modeling, respectively. Moreover, we design unique positional coding and frequency dynamic convolution for each frequency component to extract rich frequency-domain features. Extensive experiments on twenty-three restoration datasets for a range of eight restoration tasks such as deraining, dehazing, deblurring, desnowing and underwater/low-light enhancement, demonstrate that our SFHformer surpasses the state-of-the-art approaches and achieves a favorable trade-off between performance, parameter size and computational cost. We will release the code after acceptance.

# 25

Strong Double Blind

Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation

Yushun Tang · Shuoshuo Chen · Zhihe Lu · Xinchao Wang · Zhihai He

Transformer-based methods have achieved remarkable success in various machine learning tasks. How to design efficient test-time adaptation methods for transformer models becomes an important research task. In this work, motivated by the dual-subband wavelet lifting scheme developed in multi-scale signal processing which is able to efficiently separate the input signals into principal components and noise components, we introduce a dual-path token lifting for domain shift correction in test time adaptation. Specifically, we introduce an extra token, referred to as \textit{domain shift token}, at each layer of the transformer network. We then perform dual-path lifting with interleaved token prediction and update between the path of domain shift tokens and the path of class tokens at all network layers. The prediction and update networks are learned in an adversarial manner. Specifically, the task of the prediction network is to learn the residual noise of domain shift which should be largely invariant across all classes and all samples in the target domain. In other words, the predicted domain shift noise should be indistinguishable between all sample classes. On the other hand, the task of the update network is to update the class tokens by removing the domain shift from the input image samples so that input samples become more discriminative between different classes in the feature space. To effectively learn the prediction and update networks with two adversarial tasks, both theoretically and practically, we demonstrate that it is necessary to use smooth optimization for the update network but non-smooth optimization for the prediction network. Experimental results on the benchmark datasets demonstrate that our proposed method significantly improves the online fully test-time domain adaptation performance.

# 298

SuperGaussian: Repurposing Video Models for 3D Super Resolution

Yuan Shen · Duygu Ceylan · Paul Guerrero · Zexiang Xu · Niloy Mitra · Shenlong Wang · Anna Fruehstueck

We present a simple, modular, and generic method that upsamples coarse 3D models by adding geometric and appearance details. While generative 3D models now exist, they do not yet match the quality of their counterparts in image and video domains. We demonstrate that it is possible to directly repurpose existing (pre-trained) video models for 3D super-resolution and thus sidestep the problem of the shortage of large repositories of high-quality 3D training models. We describe how to repurpose video upsampling models -- which are not 3D consistent -- and combine them with 3D consolidation to produce 3D-consistent results. As output, we produce high-quality Gaussian Splat models, which are object-centric and effective. Our method is category-agnostic and can be easily incorporated into existing 3D workflows. We evaluate our proposed SuperGaussian on a variety of 3D inputs, which are diverse both in terms of complexity and representation (e.g., Gaussian Splats or NeRFs), and demonstrate that our simple method significantly improves the fidelity of the final 3D models.

# 206

Strong Double Blind

Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers

Zixuan Fu · Lanqing Guo · Chong Wang · Yufei Wang · Zhihao Li · Bihan Wen

Recent advancements in deep learning have shown impressive results in image and video denoising, leveraging extensive pairs of noisy and noise-free data for supervision. However, the challenge of acquiring paired videos for dynamic scenes hampers the practical deployment of deep video denoising techniques. In contrast, this obstacle is less pronounced in image denoising, where paired data is more readily available. In this paper, we propose a novel unsupervised video denoising framework, named ``\textbf{T}emporal \textbf{A}s a \textbf{P}lugin'' (TAP), which integrates tunable temporal modules into a pre-trained image denoiser. By incorporating the plug-and-play strategy, our TAP model can harness temporal information across noisy frames, complementing its power of spatial denoising. Furthermore, we introduce a progressive fine-tuning strategy that refines each temporal module using the generated \textit{pseudo-clean} video frames, progressively enhancing the network's denoising performance. Compared to other unsupervised video denoising methods, our framework demonstrates superior performance on both sRGB and raw video denoising datasets.

# 177

Strong Double Blind

Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

Dohyung Kim · Junghyup Lee · Jeimin Jeon · JAEHYEON MOON · BUMSUB HAM

Network quantization generally converts full-precision weights and/or activations into low-bit fixed-point values in order to accelerate an inference process. Recent approaches to network quantization further discretize the gradients into low-bit fixed-point values, enabling an efficient training. They typically set a quantization interval using a min-max range of the gradients or adjust the interval such that the quantization error for entire gradients is minimized. In this paper, we analyze the quantization error of gradients for the low-bit fixed-point training, and show that lowering the error for large-magnitude gradients boosts the quantization performance significantly. Based on this, we derive an upper bound of quantization error for the large gradients in terms of the quantization interval, and obtain an optimal condition for the interval minimizing the quantization error for large gradients. We also introduce an interval update algorithm that adjusts the quantization interval adaptively to maintain a small quantization error for large gradients. Experimental results demonstrate the effectiveness of our quantization method for various combinations of network architectures and bit-widths on various tasks, including image classification, object detection, and super-resolution. For comparison, our code will be made publicly available at the time of publication.

# 322

Strong Double Blind

Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems

Ziyuan Luo · Boxin Shi · Haoliang Li · Renjie Wan

Electromagnetic Inverse Scattering Problems (EISP) have gained wide applications in computational imaging. By solving EISP, the internal relative permittivity of the scatterer can be non-invasively determined based on the scattered electromagnetic fields. Despite previous efforts to address EISP, achieving better solutions to this problem has remained elusive, due to the challenges posed by inversion and discretization. This paper tackles those challenges in EISP via an implicit approach. By representing the scatterer's relative permittivity as a continuous implicit representation, our method is able to address the low-resolution problems arising from discretization. Further, optimizing this implicit representation within a forward framework allows us to conveniently circumvent the challenges posed by inverse estimation. Our approach outperforms existing methods on standard benchmark datasets.

# 207

Strong Double Blind

Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network

Chenhao Zhang · WEI GAO

Neural Video Compression (NVC) has achieved remarkable performance in recent years. However, precise rate control remains a challenge due to the inherent limitations of learning-based codecs. To solve this issue, we propose a dynamic video compression framework designed for variable bitrate scenarios. First, to achieve variable bitrate implementation, we propose the Dynamic-Route Autoencoder with variable coding routes, each occupying partial computational complexity of the whole network and navigating to a distinct RD trade-off. Second, to approach the target bitrate, the Rate Control Agent estimates the bitrate of each route and adjusts the coding route of DRA at run time. To encompass a broad spectrum of variable bitrates while preserving overall RD performance, we employ the Joint-Routes Optimization strategy, achieving collaborative training of various routes. Extensive experiments on the HEVC and UVG datasets show that the proposed method achieves an average BD-Rate reduction of 14.8% and BD-PSNR gain of 0.47dB over state-of-the-art methods while maintaining an average bitrate error of 1.66%, achieving Rate-Distortion-Complexity Optimization (RDCO) for various bitrate and bitrate-constrained applications.

# 203

Strong Double Blind

Spike-Temporal Latent Representation for Energy-Efficient Event-to-Video Reconstruction

Jianxiong Tang · Jian-Huang Lai · Lingxiao Yang · Xiaohua Xie

Event-to-Video (E2V) reconstruction is to recover grayscale video from the neuromorphic event streams, and Spiking Neural Networks (SNNs) are the potential energy-efficient models to solve the reconstruction problem. Event voxels are an efficient representation for compressing event streams for E2V reconstruction, but their temporal latent representation is rarely considered in SNN-based reconstruction. In this paper, we propose a spike-temporal latent representation (STLR) model for SNN-based E2V reconstruction. The STLR solves the temporal latent coding of event voxels for video frame reconstruction. It is composed of two cascaded SNNs: a) Spike-based Voxel Temporal Encoder (SVT) and b) U-shape SNN Decoder. The SVT is a spike-driven spatial unfolding network with a specially designed coding dynamic. It encodes the event voxel into the layer-wise spiking features for latent coding, approximating the fixed point of the Iterative Shrinkage-Thresholding Algorithm. Then, the U-shape SNN decoder reconstructs the video based on the encoded spikes. Experimental results show that the STLR achieves comparable performance to the popular SNNs on IJRR, HQF, and MVSEC, and significantly improves energy efficiency. For example, the STLR achieves comparable performance under 13.20% parameters and 3.33%~ 5.03% energy cost of the PA-EVSNN.

# 173

Strong Double Blind

Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data

Yanmeng Yao · Xiaohan Zhao · Bin Gu

In the field of computer vision, event-based Dynamic Vision Sensors (DVSs) have emerged as a significant complement to traditional pixel-based imaging due to their low power consumption and high temporal resolution. These sensors, particularly when combined with spiking neural networks (SNNs), offer a promising direction for energy-efficient and fast-reacting vision systems. Typically, DVS data are converted into grid-based formats for processing with SNNs, with this transformation process often being an opaque step in the pipeline. As a result, the grid representation becomes an intermediate yet inaccessible stage during the implementation of attacks, highlighting the importance of attacking raw event data. Existing attack methodologies predominantly target grid-based representations, hindered by the complexity of three-valued optimization and the broad optimization space associated with raw event data. Our study addresses this gap by introducing a novel adversarial attack approach that directly targets raw event data. We tackle the inherent challenges of three-valued optimization and the need to preserve data sparsity through a strategic amalgamation of methods: 1) Treating Discrete Event Values as Probabilistic Samples: This allows for continuous optimization by considering discrete event values as probabilistic space samples. 2) Focusing on Specific Event Positions: We prioritize specific event positions that merge original data with additional target label data, enhancing attack precision. 3) Employing a Sparsity Norm: To retain the original data's sparsity, a sparsity norm is utilized, ensuring the adversarial data's comparability. Our empirical findings demonstrate the effectiveness of our combined approach, achieving noteworthy success in targeted attacks and highlighting vulnerabilities in models based on raw event data.

# 1

Strong Double Blind

A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks

Feiyu CHEN · Wei Lin · Ziquan Liu · Antoni Chan

Imperceptible watermarks are essential in safeguarding the content authenticity and the rights of creators in imagery. Recently, several leading approaches, notably zero-bit watermarking, have demonstrated impressive imperceptibility and robustness in image watermarking. However, these methods have security weaknesses, e.g., the risk of counterfeiting and the ease of erasing an existing watermark with another watermark, while also lacking a statistical guarantee regarding the detection performance. To address this issue, we propose a novel framework to train a secret key network (SKN), which serves as a non-duplicable safeguard for securing the embedded watermark. The SKN is trained so that natural images' output obeys a standard multi-variate normal distribution. To embed a watermark, we apply an adversarial attack (a modified PGD attack) on the image such that the SKN produces a secret key signature (SKS) with a longer length. We then derive two hypothesis tests to detect the presence of the watermark in an image via the SKN response magnitude and the SKS angle, which offer a statistical guarantee of the false positive rate. Our extensive empirical study demonstrates that our framework maintains robustness comparable to existing methods and excels in security and imperceptibility.

# 125

Skeleton Recall Loss for Connectivity Conserving and Resource Efficient Segmentation of Thin Tubular Structures

Yannick Kirchhoff · Maximilian Rokuss · Saikat Roy · Balint Kovacs · Constantin Ulrich · Tassilo Wald · Maximilian Zenk · Philipp Vollmuth · Jens Kleesiek · Fabian Isensee · Klaus H. Maier-Hein

Accurately segmenting thin tubular structures, such as vessels, nerves, roads or concrete cracks, is a crucial task in computer vision. Standard deep learning-based segmentation loss functions, such as Dice or Cross-Entropy, focus on volumetric overlap, often at the expense of preserving structural connectivity or topology. This can lead to segmentation errors that adversely affect downstream tasks, including flow calculation, navigation, and structural inspection. Although current topology-focused losses mark an improvement, they introduce significant computational and memory overheads. This is particularly relevant for 3D data, rendering these losses infeasible for larger volumes as well as increasingly important multi-class segmentation problems. To mitigate this, we propose a novel Skeleton Recall Loss, which effectively addresses these challenges by circumventing intensive GPU-based calculations with inexpensive CPU operations. It demonstrates overall superior performance to current state-of-the-art approaches on five public datasets for topology-preserving segmentation, while substantially reducing computational overheads by more than 90%. In doing so, we introduce the first multi-class capable loss function for thin structure segmentation, excelling in both efficiency and efficacy for topology-preservation. Our code is available to the community, providing a foundation for further advancements, at: www.github.com/anonymous.

# 299

Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection

Christos Koutlis · Symeon Papadopoulos

The recently developed and publicly available synthetic image generation methods and services make it possible to create extremely realistic imagery on demand, raising great risks for the integrity and safety of online information. State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from pre-trained foundation models. However, such extracted features mostly encapsulate high-level visual semantics instead of fine-grained details, which are more important for the SID task. On the contrary, shallow layers encode low-level visual information. In this work, we leverage the image representations extracted by intermediate Transformer blocks of CLIP's image-encoder via a lightweight network that maps them to a learnable forgery-aware vector space capable of generalizing exceptionally well. We also employ a trainable module to incorporate the importance of each Transformer block to the final prediction. Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement. Notably, the best performing models require just a single epoch for training (~8 minutes). Code as supplementary material for the review process. It will be publicly available upon acceptance.

# 65

Strong Double Blind

Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing

SI-QI LIU · Qirui Wang · Pong Chi Yuen

Face anti-spoofing (FAS) which plays an important role in securing face recognition systems has been attracting increasing attention. Recently, the popular vision-language model CLIP has been proven to be effective for FAS, where outstanding performance can be achieved by simply transferring the class label into textual prompt. In this work, we aim to improve the generalization ability of CLIP-based FAS from a prompt learning perspective. Specifically, a Bottom-Up Domain Prompt Tuning method (BUDoPT) that covers the different levels of domain variance, including the domain of recording settings and domain of attack types is proposed. To handle domain discrepancies of recording settings, we design a context-aware adversarial domain-generalized prompt learning strategy that can learn domain-invariant prompt. For the spoofing domain with different attack types, we construct a fine-grained textual prompt that guides CLIP to look through the subtle details of different attack instruments. Extensive experiments are conducted on five FAS datasets with a large number of variations (camera types, resolutions, image qualities, lighting conditions, and recording environments). The effectiveness of our proposed method is evaluated with different amounts of source domains from multiple angles, where we boost the generalizability compared with the state of the arts with multiple training datasets or with only one dataset.

# 55

Strong Double Blind

Real Appearance Modeling for More General Deepfake Detection

Jiahe Tian · Yu Cai · Xi Wang · Peng Chen · Zihao Xiao · Jiao Dai · Yesheng Chai · Jizhong Han

Recent studies in deepfake detection have shown promising results when detecting deepfakes of the same type as those present in training. However, their ability to generalize to unseen deepfakes remains limited. This work improves the generalizable deepfake detection from a simple principle: an ideal detector classifies any face that contains anomalies not found in real faces as fake. Namely, detectors should learn consistent real appearances rather than fake patterns in the training set that may not apply to unseen deepfakes. Guided by this principle, we propose a learning task named Real Appearance Modeling (RAM) that guides the model to learn real appearances by recovering original faces from slightly disturbed faces. We further propose Face Disturbance to produce disturbed faces while preserving original information that enables recovery, which aids the model in learning the fine-grained appearance of real faces. Extensive experiments demonstrate the effectiveness of modeling real appearances to spot richer deepfakes. Our method surpasses existing state-of-the-art methods by a large margin on multiple popular deepfake datasets.

# 278

SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

Jaeseong Lee · Junha Hyung · Sohyun Jeong · Choo Jaegul

Face swapping has gained significant attention for its varied applications. Most previous face swapping approaches have relied on the seesaw game training scheme, also known as the target-oriented approach. However, this often leads to instability in model training and results in undesired samples with blended identities due to the target identity leakage problem. Source-oriented methods achieve more stable training with self-reconstruction objective but often fail to accurately reflect target image's skin color and illumination. This paper introduces the Shape Agnostic Masked AutoEncoder (SAMAE) training scheme, a novel self-supervised approach that combines the strengths of both target-oriented and source-oriented approaches. Our training scheme addresses the limitations of traditional training methods by circumventing the conventional seesaw game and introducing clear ground truth through its self-reconstruction training regime. Our model effectively mitigates identity leakage and reflects target albedo and illumination through learned disentangled identity and non-identity features. Additionally, we closely tackle the shape misalignment and volume discrepancy problems with new techniques, including perforation confusion and random mesh scaling. SAMAE establishes a new state-of-the-art, surpassing other baseline methods, preserving both identity and non-identity attributes without sacrificing on either aspect.

# 221

Strong Double Blind

Norface: Improving Facial Expression Analysis by Identity Normalization

Hanwei Liu · Rudong An · Zhimeng Zhang · Bowen Ma · Wei Zhang · Yan Song · Yujing Hu · Chen Wei · Yu Ding

Facial Expression Analysis (FEA) remains a challenging task due to unexpected task-irrelevant noise, such as identity, head pose, and background. To address this issue, this paper proposes a novel framework, called Norface, that is unified for both Action Unit (AU) analysis and Facial Emotion Recognition (FER) tasks. Norface consists of a normalization network and a classification network. First, the carefully designed normalization network struggles to directly remove the above task-irrelevant noise, by maintaining facial expression consistency but normalizing all original images to a common identity with consistent pose, and background. Then, these additional normalized images are fed into the classification network. Due to consistent identity and other factors (e.g. head pose, background, etc.), the normalized images enable the classification network to extract useful expression information more effectively. Additionally, the classification network incorporates a Mixture of Experts to refine the latent representation, including handling the input of facial representations and the output of multiple (AU or emotion) labels. Extensive experiments validate the carefully designed framework with the insight of identity normalization. The proposed method outperforms existing SOTA methods in multiple facial expression analysis tasks, including AU detection, AU intensity estimation, and FER tasks, as well as their cross-dataset tasks. For the normalized datasets and code please visit project page.

# 106

Strong Double Blind

Open-Set Biometrics: Beyond Good Closed-Set Models

Yiyang Su · Minchul Kim · Feng Liu · Anil Jain · Xiaoming Liu

Biometric recognition has primarily addressed closed-set identification, assuming all probe subjects are known to be in the gallery. However, most practical applications involve open-set biometrics, where probe subjects may or may not be present in the gallery. This poses distinct challenges in effectively distinguishing individuals in the gallery while minimizing false detections. Despite assuming that powerful biometric models can excel in both closed- and open-set scenarios, existing loss functions are inconsistent with open-set evaluation. The genuine (mated) and imposter (non-mated) similarity scores symmetrically and neglect the relative magnitudes of imposter scores. To address these issues, we introduce novel loss functions: (1) the \textbf{identification-detection} loss optimized for open-set performance under selective thresholds and (2) \textbf{relative threshold minimization} to reduce the maximum negative score for each probe. Across diverse biometric tasks, including face recognition, gait recognition, and person re-identification, our experiments demonstrate the effectiveness of the proposed loss functions, significantly enhancing open-set performance while positively impacting closed-set performance. Upon publication, we will release our code and models.

# 192

Strong Double Blind

Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals

Camilo Fosco · Benjamin Lahner · Bowen Pan · Alex Andonian · Emilie L Josephs · Alex Lascelles · Aude Oliva

The field of brain-to-stimuli reconstruction has seen significant progress in the last few years, but techniques continue to be subject-specific and are usually tested on a single dataset. In this work, we present a novel technique to reconstruct videos from functional Magnetic Resonance Imaging (fMRI) signals designed for performance across datasets and across human participants. Our pipeline accurately generates 2 and 3-second video clips from brain activity coming from distinct participants and different datasets by leveraging multi-dataset and multi-subject training. This helps us regress key latent and conditioning vectors for pretrained text-to-video and video-to-video models to reconstruct accurate videos that match the original stimuli observed by the participant. Key to our pipeline is the introduction of a 3-stage approach that first aligns fMRI signals to semantic embeddings, then regresses important vectors, and finally generates videos with those estimations. Our method demonstrates state-of-the-art reconstruction capabilities verified by qualitative and quantitative analyses, including crowd-sourced human evaluation. We showcase performance improvements across two datasets, as well as in multi-subject setups. Our ablation studies shed light on how different alignment strategies and data scaling decisions impact reconstruction performance, and we hint at a future for zero-shot reconstruction by analyzing how performance evolves as more subject data is leveraged.

# 112

Strong Double Blind

PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion

Runsong Zhu · Shi Qiu · Qianyi Wu · Ka-Hei Hui · Pheng-Ann Heng · Chi-Wing Fu

Panoptic lifting is an effective technique to address the 3D panoptic segmentation task by unprojecting 2D panoptic segmentations from multi-views to 3D scene. However, the quality of its result largely depends on the 2D segmentations, which could be noisy and error-prone, so its performance often drops significantly for complex scenes. In this work, we design a new pipeline coined \textbf{PCF-Lift} based on our \textbf{P}robabilistic \textbf{C}ontrastive \textbf{F}usion (PCF) to learn and embed probabilistic features throughout our pipeline to actively consider inaccurate segmentations and inconsistent instance IDs. Technical-wise, we first model the probabilistic feature embeddings through multivariate Gaussian distributions. To fuse the probabilistic features, we incorporate the probability product kernel into the contrastive loss formulation and design a cross-view constraint to enhance the feature consistency across different views. For the inference, we introduce a new probabilistic clustering method to effectively associate prototype features with the underlying 3D object instances for the generation of consistent panoptic segmentation results. Further, we provide a theoretical analysis to justify the superiority of the proposed probabilistic solution. By conducting extensive experiments, our PCF-lift not only significantly outperforms the state-of-the-art methods on widely used benchmarks including the ScanNet dataset and the challenging Messy Room dataset (4.4\% improvement of scene-level PQ), but also demonstrates strong robustness when incorporating with various 2D segmentation models or different levels of hand-crafted noise.

# 52

Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks

Zhewei Wu · Ruilong Yu · Qihe Liu · Shuying Cheng · Shilin Qiu · Shijie Zhou

Adversarial attacks in visual object tracking have significantly degraded the performance of advanced trackers by introducing imperceptible perturbations into images. However, there is still a lack of research on designing adversarial defense methods for object tracking. To address these issues, we propose an effective auxiliary pre-processing defense network, AADN, which performs defensive transformations on the input images before feeding them into the tracker. Moreover, it can be seamlessly integrated with other visual trackers as a plug-and-play module without parameter adjustments. We train AADN using adversarial training, specifically employing Dua-Loss to generate adversarial samples that simultaneously attack the classification and regression branches of the tracker. Extensive experiments conducted on the OTB100, LaSOT, and VOT2018 benchmarks demonstrate that AADN maintains excellent defense robustness against adversarial attack methods in both adaptive and non-adaptive attack scenarios. Moreover, when transferring the defense network to heterogeneous trackers, it exhibits reliable transferability. Finally, AADN achieves a processing time of up to 5ms/frame, allowing seamless integration with existing high-speed trackers without introducing significant computational overhead. We will make our code publicly available soon.

# 107

Strong Double Blind

SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking

Siyuan Li · Lei Ke · Yung-Hsu Yang · Luigi Piccinelli · Mattia Segu · Martin Danelljan · Luc Van Gool

Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers location/motion, semantics and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods significantly for novel classes tracking on the Open-vocabulary MOT and TAO TETA benchmarks. Our code and models will be released.

# 103

Strong Double Blind

Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition

Haijun Xiong · Bin Feng · Xinggang Wang · Wenyu Liu

Gait recognition, a biometric technology, aims to distinguish individuals by their walking patterns. However, we reveal that discriminative identity features often become entangled with non-identity clues, posing a challenge for extracting identity features effectively and efficiently in previous methods. To address this challenge, we propose CLTD, a causality-inspired discriminative feature learning module designed to effectively eliminate the influence of confounders in triple domains, \ie, the spatial, temporal, and spectral domains. Specifically, we utilize the Cross Pixel-wise Attention Generator (CPAG) to generate attention distributions for factual and counterfactual features in spatial and temporal domains. Then, we introduce the Fourier Projection Head (FPH) to project spatial features into the spectral space, preserving essential information while reducing computational costs. Furthermore, we employ an optimization method with contrastive learning to enforce semantic consistency constraints across sequences from the same subject. The significant performance improvements on challenging datasets demonstrate the effectiveness of our method. In addition, our method can seamlessly integrate into existing gait recognition methods.

# 194

VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

Yankun Xu · Junzhe Wang · Yun-Hsuan Chen · Jie Yang · Wenjie Ming · Shuang Wang · Mohamad Sawan

An accurate and efficient epileptic seizure onset detection can significantly benefit patients. Traditional diagnostic methods, primarily relying on electroencephalograms (EEGs), often result in cumbersome and non-portable solutions, making continuous patient monitoring challenging. The video-based seizure detection system is expected to free patients from the constraints of scalp or implanted EEG devices and enable remote monitoring in residential settings. Previous video-based methods neither enable all-day monitoring nor provide short detection latency due to insufficient resources and ineffective patient action recognition techniques. Additionally, skeleton-based action recognition approaches remain limitations in identifying subtle seizure-related actions. To address these challenges, we propose a novel Video-based Seizure detection model via a skeleton-based spatiotemporal Vision Graph neural network (VSViG) for its efficient, accurate and timely purpose in real-time scenarios. Our experimental results indicate VSViG outperforms previous state-of-the-art action recognition models on our collected patients' video data with higher accuracy (5.9% error), lower FLOPs (0.4G), and smaller model size (1.4M). Furthermore, by integrating a decision-making rule that combines output probabilities and an accumulative function, we achieve a 5.1 s detection latency after EEG onset, a 13.1 s detection advance before clinical onset, and a zero false detection rate.

# 198

Strong Double Blind

Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation

Haoyu Ji · Bowen Chen · Xinglong Xu · Weihong Ren · Zhiyong Wang · Honghai Liu

Skeleton-based Temporal Action Segmentation (STAS) aims at densely segmenting and classifying human actions in long untrimmed skeletal motion sequences. Existing STAS methods primarily model the spatial dependencies among joints and the temporal relationships among frames to generate frame-level one-hot classifications. However, these research overlook the deep mining of semantic relations among joints as well as actions at a linguistic level, which limits the comprehensiveness of skeleton action understanding. In this work, we propose a Language-assisted Skeleton Action Understanding (LaSA) method, leveraging Large-scale Language Models (LLM) to assist in learning semantic relationships among joints and actions. Specifically, in terms of joint relationships, the Joint Relationships Establishment (JRE) module establishes correlations among joints in the feature sequence through attention between joint texts and embeds joint texts as position embeddings to differentiate distinct joints. Regarding action relationships, the Action Relationships Supervision (ARS) module enhances the discrimination across action classes through contrastive learning of single-class action-text pairs and temporally models the semantic associations of adjacent actions by contrasting mixed-class clip-text pairs. Performance evaluation on five public datasets demonstrates that LaSA has achieved state-of-the-art performance.

# 212

Strong Double Blind

Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders

Lucas Stoffl · Andy Bonnetto · Stéphane D'Ascoli · Alexander Mathis

Natural behavior is hierarchical. Yet, there is a paucity of benchmarks addressing this aspect. Recognizing the scarcity of large-scale hierarchical behavioral benchmarks, we create a novel synthetic basketball playing benchmark (Shot7M2). Beyond synthetic data, we extend BABEL into a hierarchical action segmentation benchmark (hBABEL). Then, we develop a masked autoencoder framework (hBehaveMAE) to elucidate the hierarchical nature of motion capture data in an unsupervised fashion. We find that hBehaveMAE learns interpretable latents on Shot7M2 and hBABEL, where lower encoder levels show a superior ability to represent fine-grained movements, while higher encoder levels capture complex actions and activities. Additionally, we evaluate hBehaveMAE on MABe22, a representation learning benchmark with short and long-term behavioral states. hBehaveMAE achieves state-of-the-art performance without domain-specific feature extraction. Together, these components synergistically contribute towards unveiling the hierarchical organization of natural behavior. Models and benchmarks are available at https://github.com/amathislab/BehaveMAE

# 196

Strong Double Blind

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

Ishan Rajendrakumar Dave · Mamshad Nayeem Rizve · Shah Mubarak

Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudo-labeling-based framework 'FinePseudo' significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups.

# 188

Strong Double Blind

Bayesian Evidential Deep Learning for Online Action Detection

Hongji Guo · Hanjing Wang · Qiang Ji

Online action detection aims at identifying the ongoing action in a streaming video without seeing the future. Timely and reliable response is critical for real-world applications. In this paper, we introduce Bayesian Evidential Deep Learning (BEDL), an efficient and generalizable framework for online action detection and uncertainty quantification. Specifically, we combine Bayesian neural networks and evidential deep learning by a teacher-student architecture. The teacher model is built in a Bayesian manner and transfers its mutual information and distribution to the student model through evidential deep learning. In this way, the student model can make accurate online inference while efficiently quantifying the uncertainty. Compared to existing evidential deep learning methods, BEDL estimates uncertainty more accurately by leveraging the Bayesian teacher model. In addition, we designed an attention module for BEDL that can select important features based on the Bayesian mutual information for online inference. We evaluated BEDL on benchmark datasets including THUMPS'14, TVSeries, and HDD. BEDL achieves competitive performance while keeping efficient inference. Extensive ablation studies demonstrate the effectiveness of each component. And the uncertainty quantification is verified by experiments of online anomaly detection using the student model.

# 179

Event Camera Data Dense Pre-training

Yan Yang · Liyuan Pan · Liu liu

This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training. Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features. For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.

# 185

Unsupervised Moving Object Segmentation with Atmospheric Turbulence

Dehao Qin · Ripon Saha · Woojeh Chung · Suren Jayasuriya · Jinwei Ye · Nianyi Li

Moving object segmentation in the presence of atmospheric turbulence is highly challenging due to turbulence-induced irregular and time-varying distortions. In this paper, we present an unsupervised approach for segmenting moving objects in videos downgraded by atmospheric turbulence. Our key approach is to adopt a detect-then-grow scheme: we first identify a small set of pixels that belong to moving objects with high confidence, then gradually grow a foreground mask from those seeds that segment all moving objects in the scene. In order to disentangle different types of motions, we check the rigid geometric consistency among video frames. We then use the Sampson distance to initialize the seedling pixels. After growing per-frame foreground masks, we use spatial grouping loss and temporal consistency loss to further refine the masks in order to ensure their spatio-temporal consistency. Our method is unsupervised and does not require training on labeled data. For validation, we collect and release the first real-captured long-range turbulent video dataset with ground truth masks for moving objects. We evaluate our method both qualitatively and quantitatively on our real dataset. Results show that our method achieves good accuracy in segmenting moving objects and is robust for long-range videos with various turbulence strengths.

# 181

Beyond MOT: Semantic Multi-Object Tracking

Yunhao Li · Qin Li · Hao Wang · Xue Ma · Jiali Yao · Shaohua Dong · Heng Fan · Libo Zhang

Current multi-object tracking (MOT) aims to predict trajectories of targets (\ie,"where'') in videos. Yet, knowing merely "where'' is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (\ie, "what'') from videos, associated with "where'', is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating "where'' and "what'' for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting "where'' and "what'' for SMOT, opening up a new direction in tracking for video understanding. Our BenSMOT and SMOTer will be released.

# 178

Strong Double Blind

MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning

Dongyao Jiang · Hui Chen · Haodong Jing · Yongqiang Ma · Nanning Zheng

Compositional Zero-Shot Learning (CZSL) aims to classify unseen state-object compositions using seen primitives. Previous methods commonly map an identical primitive from different compositions to the same area within embedding space, aiming to establish primitive representation or assess decoding proficiency. However, relying solely on the intersection area of primitive concepts might overlook nuanced semantics due to conditional variance, thereby limiting the model's capacity to generalize to unseen compositions. In contrast, our approach constructs primitive representations by considering the union area of primitives. We propose a Multiple Representation of Single Primitive learning framework (termed MRSP) for CZSL, which captures composition-relevant features through a state-object-composition three-branch cross-attention architecture. Specifically, the input image feature cross-attends to multiple state, object, and composition features and the prediction scores are adaptively adjusted by combining the output of each branch. Extensive experiments on three benchmarks in both closed-world and open-world settings showcase the superior effectiveness of MRSP.

# 199

Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition

Shreyank Narayana Gowda · Anurag Arnab · Jonathan Huang

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations (a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by ) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption. We also show the generalization of this approach to other factorized encoder models. The advancements made in this work have the potential to advance research in the video understanding domain and provide valuable insights for researchers and practitioners with limited resources, paving the way for more efficient and scalable alternatives in the action recognition field.

# 187

Strong Double Blind

Open Vocabulary Multi-Label Video Classification

Rohit Gupta · Mamshad Nayeem Rizve · Jayakrishnan Unnikrishnan · Ashish Tawari · Son Tran · Shah Mubarak · Benjamin Yao · Trishul A Chilimbi

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multi-label video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.

# 200

R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Ye Liu · Jixuan He · Wanhua Li · Junsik Kim · Donglai Wei · Hanspeter Pfister · Chang Wen Chen

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning (R^2-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight R^2 Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, R^2 Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. R^2-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code will be publicly available.

# 184

Leveraging temporal contextualization for video action recognition

Minji Kim · Dongyoon Han · Taekyung Kim · Bohyung Han

Pretrained vision-language models (VLM) have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, Our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised settings to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Our code will be publicly available.

# 182

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Yue Fan · Xiaojian Ma · Rujie Wu · yuntao du · Jiaqi Li · Zhi Gao · Qing Li

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. \method demonstrates impressive performances on several long-horizon video understanding benchmarks, on average increasing 6.6% on NExT-QA and 26.0% on EgoSchema over baselines. The code will be released to the public.

# 180

Strong Double Blind

KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval

Xianwei Zhuang · Hongxiang Li · Xuxin Cheng · Zhihong Zhu · Yuxin Xie · Yuexian Zou

Existing video-text retrieval methods predominantly focus on designing diverse cross-modal interaction mechanisms between captions and videos. However, those approaches diverge from human learning paradigms, where humans possess the capability to seek and associate knowledge from an open set, rather than rote memorizing all text-video instances. Motivated by this, we attempt to decouple knowledge from retrieval models through multi-grained knowledge stores and identify two significant benefits of our knowledge-decoupling strategy: (1) it ensures a harmonious balance between knowledge memorization and retrieval optimization, thereby improving retrieval performance; and (2) it can promote incorporating diverse open-world knowledge to augment video-text retrieval. To efficiently integrate information from knowledge stores, we further introduce a novel retrieval framework termed KDProR, which utilizes our proposed Expectation-Knowledge-Maximization (EKM) algorithm for optimization. Specifically, in E-step, KDProR obtains relevant contextual semantics from knowledge stores and achieves efficient knowledge injection through interpolation and alignment correction. During the K-step, KDProR calculates the knowledge KNN distribution by indexing the Top-K acquired knowledge to refine the retrieval distribution, and in M-step, KDProR optimizes the retrieval model by maximizing the likelihood of the objective. Extensive experiments on various benchmarks prove that KDProR significantly outperforms previous state-of-the-art methods across all metrics. Remarkably, KDProR can uniformly and efficiently incorporate diverse open-world knowledge and is compatible with different interaction mechanisms and architectures.

# 190

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Yi Wang · Kunchang Li · Xinhao Li · Jiashuo Yu · Yinan He · Guo Chen · Baoqi Pei · Rongkun Zheng · Jilan Xu · Zun Wang · Yansong Shi · Tianxiang Jiang · SongZe Li · hongjie Zhang · Yifei Huang · Yu Qiao · Yali Wang · Limin Wang

We introduce InternVideo2, a new video foundation model (ViFM) that achieves state-of-the-art results in action recognition, video-text tasks, and video-centric dialogue. Our system design includes a progressive approach that unifies the learning of masked video token reconstruction, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate state-of-the-art performance on over 60 out of 74 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models will be released.

# 189

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Nina Shvetsova · Anna Kukleva · Xudong Hong · Christian Rupprecht · Bernt Schiele · Hilde Kuehne

Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.

# 193

Strong Double Blind

Label-anticipated Event Disentanglement for Audio-Visual Video Parsing

Jinxing Zhou · Dan Guo · Yuxin Mao · Yiran Zhong · Xiaojun Chang · Meng Wang

Audio-Visual Video Parsing (AVVP) task aims to detect and temporally locate events within audio and visual modalities. Multiple events can overlap in the timeline, making identification challenging. While traditional methods usually focus on improving the early audio-visual encoders to embed more effective features, the decoding phase – crucial for final event classification, often receives less attention. We aim to advance the decoding phase and improve its interpretability. Specifically, we introduce a new decoding paradigm, label semantic-based projection (LEAP), that employs labels texts of event categories, each bearing distinct and explicit semantics, for parsing potentially overlapping events. LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. This process, enriched by modeling cross-modal (audio/visuallabel) interactions, gradually disentangles event semantics within video segments to refine relevant label embeddings, guaranteeing a more discriminative and interpretable decoding process. To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function. This function leverages the Intersection over Union of audio and visual events (EIoU) as a novel metric to calibrate audio-visual similarities at the feature level, accommodating the varied event densities across modalities. Extensive experiments demonstrate the superiority of our method, achieving new state-of-the-art performance for AVVP and also enhancing the relevant audio-visual event localization task.

# 195

Strong Double Blind

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Juncheng Ma · Peiwen Sun · Yaoting Wang · Di Hu

Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called Stepping Stones, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks. The code will be released soon.

# 183

Uncertainty-aware sign language video retrieval with probability distribution modeling

Xuan Wu · Hongxiang Li · yuanjiang luo · Xuxin Cheng · Xianwei Zhuang · Meng Cao · Keren Fu

Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign(59.1%), PHOENIX-2014T(72.0%), and CSL-Daily(78.4%).

# 204

Strong Double Blind

NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition

Chenyu Liu · Jia Pan · Jinshui Hu · Baocai Yin · Bing Yin · Mingjun Chen · Cong Liu · Jun Du · Qingfeng Liu

Recently, Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding. Current methods typically approach HMER as an image-to-sequence generation task within an autoregressive (AR) encoder-decoder framework. However, these approaches suffer from several drawbacks: 1) a lack of overall language context, limiting information utilization beyond the current decoding step; 2) error accumulation during AR decoding; and 3) slow decoding speed. To tackle these problems, this paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER. NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph Decoder (PGD). Initially, the VAT tokenizes visible symbols and local relations at a coarse level. Subsequently, the PGD refines all tokens and establishes connectivities in parallel, leveraging comprehensive visual and linguistic contexts. Experiments on CROHME 2014/2016/2019 and HME100K datasets demonstrate that NAMER not only outperforms the current state-of-the-art (SOTA) methods on ExpRate by 1.93%/2.35%/1.49%/0.62%, but also achieves significant speedups of 13.7x and 6.7x faster in decoding time and overall FPS, proving the effectiveness and efficiency of NAMER.

# 109

Strong Double Blind

Domain Shifting: A Generalized Solution for Heterogeneous Cross-Modality Person Re-Identification

Yan Jiang · Xu Cheng · Hao Yu · Xingyu Liu · Haoyu Chen · Guoying Zhao

Cross-modality person re-identification (ReID) is a challenging task that aims to match cross-modality pedestrian images across multiple camera views. Existing methods are tailored to specific tasks and perform well for visible-infrared or visible-sketch ReID. However, the performance exhibits a notable decline when the same method is utilized for multiple cross-modality ReIDs, limiting its generalization and applicability. To address this issue, we propose a generalized domain shifting method (DNS) for cross-modality ReID, which can address the generalization and perform well in both visible-infrared and visible-sketch modalities. Specifically, we propose the heterogeneous space shifting and common space shifting modules to augment specific and shared representations in heterogeneous space and common space, respectively, thereby regulating the model to learn the consistency between modalities. Further, a domain alignment loss is developed to alleviate the cross-modality discrepancies by aligning the patterns across modalities. In addition, a domain distillation loss is designed to distill identity-invariant knowledge by learning the distribution of different modalities. Extensive experiments on two cross-modality ReID tasks (i.e., visible-infrared ReID, visible-sketch ReID) demonstrate that the proposed method outperforms the state-of-the-art methods by a large margin. The source code will be made available.

# 47

Strong Double Blind

HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis

Fangqin Zhou · Mert Kilickaya · Joaquin Vanschoren · Ran Piao

Hyperspectral Imaging (HSI) plays an increasingly critical role in precise vision tasks within remote sensing, capturing a wide spectrum of visual data. Transformer architectures have significantly enhanced HSI task performance, while advancements in Transformer Architecture Search (TAS) have improved model discovery. To harness these advancements for HSI classification, we make the following contributions: i) We propose HyTAS, the first benchmark on transformer architecture search for Hyperspectral imaging, ii) We comprehensively evaluate 12 different methods to identify the optimal transformer over 5 different datasets, iii) We perform an extensive factor analysis on the Hyperspectral transformer search performance, greatly motivating future research in this direction. All benchmark materials included in our supplementary will be publicly available upon publication.

# 111

Strong Double Blind

VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition

Ahmad Khaliq · Ming Xu · Stephen Hausler · Michael J Milford · Sourav Garg

Visual Place Recognition (VPR) is a crucial component of many visual localization pipelines for embodied agents. VPR is often achieved by jointly learning local features and an aggregation method. The current state-of-the-art VPR methods rely on VLAD aggregation, which can be trained to learn a weighted contribution of features through their soft assignment to cluster centers. However, this process has two key limitations. Firstly, the feature-to-cluster weighting does not account for over-represented repetitive structures within a cluster, e.g., shadows or window panes; this phenomenon is also referred to as the `burstiness' problem, classically solved by discounting repetitive features before aggregation. Secondly, feature to cluster comparisons are compute-intensive for state-of-the-art image encoders with high-dimensional local features. This paper addresses these limitations by introducing VLAD-BuFF with two novel contributions: i) a self-similarity based feature discounting mechanism to learn {Bu}rst-aware features within end-to-end VPR training, and ii) {F}ast {F}eature aggregation} by reducing local feature dimensions through a learnable projection initialized through a PCA transform. We benchmark our method on 9 public datasets, where VLAD-BuFF sets a new state of the art and achieves perfect recall on St Lucia for the first time in VPR research. Our method is able to maintain its high recall even for 12x reduced local feature dimensions, thus enabling fast feature aggregation without compromising on recall. Through additional qualitative studies, we show how our proposed weighting method effectively downweights the non-distinctive features. We will make the source code publicly available.

# 160

Embodied Understanding of Driving Scenarios

Yunsong Zhou · Linyan Huang · Qingwen Bu · Jia Zeng · Tianyu Li · Hang Qiu · Hongzi Zhu · Minyi Guo · Yu Qiao · Hongyang Li

Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving and formulate appropriate rubrics. Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities. Besides, the model employs time-aware token selection to accurately inquire about temporal cues. We instantiate ELM on the reformulated multi-faced benchmark, and it surpasses previous state-of-the-art approaches in all aspects. All code, data, and models are accessible.

# 165

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Jingkang Yang · Yuhao Dong · Shuai Liu · Bo Li · Ziyue Wang · ChenCheng Jiang · Haoran Tan · Jiamu Kang · Yuanhan Zhang · Kaiyang Zhou · Ziwei Liu

Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. When integrated into an embodied agent, existing embodied VLM works either output detailed action sequences at the manipulation level or only provide plans at an abstract level, leaving a gap between high-level planning and real-world manipulation. To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation. Octopus is designed to 1) proficiently comprehend an agent's visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code. To facilitate Octopus model development, we introduce OctoVerse: a suite of environments tailored for benchmarking vision-based code generators on a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games such as Grand Theft Auto (GTA) and Minecraft. To train Octopus, we leverage GPT-4 to control an explorative agent that generates training data, i.e., action blueprints and corresponding executable code. We also collect feedback that enables an enhanced training scheme called Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we demonstrate Octopus's functionality and present compelling results, showing that the proposed RLEF refines the agent's decision-making. By open-sourcing our simulation environments, dataset, and model architecture, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community.

# 99

Finding Visual Task Vectors

Alberto Hojel · Yutong Bai · Trevor Darrell · Amir Globerson · Amir Bar

Visual Prompting is a technique for teaching models to perform a visual task via in-context examples, and without any additional training. In this work, we analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find Task Vectors, activations that encode task specific information. Equipped with this insight, we demonstrate that it is possible to identify the task vectors and use them to guide the network towards performing different tasks without providing any input-output example(s). We propose a two-step approach to identifying task vectors. First, we rank the model activations by a relevance score, then apply a simple greedy search algorithm to select the task vectors from the top scoring activations. Surprisingly, by patching the resulting task vectors it is possible to control the desired task output and achieve performance that is competitive with the original model across multiple tasks while reducing the need for input-output example(s).

# 101

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Zhaoyang Liu · Zeqiang Lai · Zhangwei Gao · erfei cui · Ziheng Li · Xizhou Zhu · Lewei Lu · Qifeng Chen · Yu Qiao · Jifeng Dai · Wenhai Wang

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and parameterization, and inefficient tool scheduling. To overcome these challenges, our framework comprises three key components: (1) a \textit{task decomposer} that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) paradigm} that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an \textit{execution engine with a rich toolbox} that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods.

# 93

Strong Double Blind

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

CHENMING ZHU · Tai Wang · Wenwei Zhang · Kai Chen · Xihui Liu

Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. This benchmark challenges models to conduct joint reasoning on questions and the 3D environment before predicting the 3D locations of target objects. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach. Our code will be released to the community.

# 87

Uni3DL: A Unified Model for 3D Vision-Language Understanding

Xiang Li · Jian Ding · Zhaoyang Chen · Mohamed Elhoseiny

In this work, we present Uni3DL, a unified model for 3D Vision-Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D and language understanding.

# 297

CrossScore: A Multi-View Approach to Image Evaluation and Scoring

Zirui Wang · Wenjing Bian · Victor Adrian Prisacariu

We introduce a novel cross-reference image quality assessment method that effectively fills the gap in the image assessment landscape, complementing the array of established evaluation schemes -- ranging from full-reference metrics like SSIM, no-reference metrics such as NIQE, to general-reference metrics including FID, and Multi-modal-reference metrics, e.g., CLIPScore. Utilising a neural network with the cross-attention mechanism and a unique data collection pipeline from NVS optimisation, our method enables accurate image quality assessment without requiring ground truth references. By comparing a query image against multiple views of the same scene, our method addresses the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where direct reference images are unavailable. Experimental results show that our method is closely correlated to the full-referenced metric SSIM, while not requiring ground truth references. Our code will be publicly available.

# 97

Strong Double Blind

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

Chuanhao Li · Zhen Li · Chenchen Jing · Yuwei Wu · Mingliang Zhai · Yunde Jia

Compositional generalization has received much attention in vision-and-language and visual reasoning recently. Substitutivity, the capability to generalize to novel compositions with synonymous primitives such as words and visual entities, is an essential factor in evaluating the compositional generalization ability but remains largely unexplored. In this paper, we explore the compositional substitutivity of visual reasoning in the context of visual question answering (VQA). We propose a training framework for VQA models to maintain compositional substitutivity. The basic idea is to learn invariant representations for synonymous primitives via support-sets. Specifically, for each question-image pair, we construct a support question set and a support image set, and both sets contain questions/images that share synonymous primitives with the original question/image. By enforcing a VQA model to reconstruct the original question/image with the sets, the model is able to identify which primitives are synonymous. To quantitatively evaluate the substitutivity of VQA models, we introduce two datasets: GQA-SPS and VQA-SPS v2, by performing three types of substitutions using synonymous primitives including words, visual entities, and referents. Experimental results demonstrate the effectiveness of our framework. We release GQA-SPS and VQA-SPS v2 at https://github.com/NeverMoreLCH/CG-SPS.

# 98

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Weiyun Wang Weiyun · yiming ren · Haowen Luo · Tiantong Li · Chenxiang Yan · Zhe Chen · Wenhai Wang · Qingyun Li · Lewei Lu · Xizhou Zhu · Yu Qiao · Jifeng Dai

We present the All-Seeing Project V2: a new model and dataset designed for understanding object relations in images. Specifically, we propose the All-Seeing Model V2 (ASMv2) that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation (ReC) task. Leveraging this unified task, our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them, diminishing the relation hallucination often encountered by Multi-modal Large Language Models (MLLMs). To facilitate training and evaluation of MLLMs in relation understanding, we created the first high-quality ReC dataset ({AS-V2) which is aligned with the format of standard instruction tuning data. In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs. Notably, our ASMv2 achieves an overall accuracy of 52.04 on this relation-aware benchmark, surpassing the 43.14 of LLaVA-1.5 by a large margin. We hope that our work can inspire more future research and contribute to the evolution towards artificial general intelligence. Data, model, and code shall be released.

# 92

X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning

Artemis Panagopoulou · Le Xue · Ning Yu · LI JUNNAN · DONGXU LI · Shafiq Joty · Ran Xu · Silvio Savarese · Caiming Xiong · Juan Carlos Niebles

Recent research has achieved significant advancements in visual reasoning tasks through learning image-to-language projections and leveraging the impressive reasoning abilities of Large Language Models (LLMs). This paper introduces an efficient and effective framework that integrates multiple modalities (images, 3D, audio and video) to a frozen LLM and unveils an emergent ability for cross-modal reasoning (2+ modality inputs). Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs). Through extensive experimentation across all four modalities on 16 benchmarks, we explore both methods and assess their adaptability in integrated and separate cross-modal reasoning. The Q-Former projection demonstrates superior performance in single-modality scenarios and adaptability in joint versus discriminative reasoning involving two or more modalities. However, it exhibits lower generalization capabilities than linear projection in contexts where task-modality data are limited. To enable this framework, we design a scalable pipeline that automatically generates high-quality instruction-tuning datasets from readily available captioning data across different modalities, and contribute 24K QA data for audio and 250K QA data for 3D. To facilitate further research in cross-modal reasoning, we contribute the DisCRn Discriminative Cross-modal Reasoning (DisCRn) benchmark comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities. Datasets, models, and the code will be released.

# 84

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Siming Yan · Min Bai · Weifeng Chen · Xiong Zhou · Qixing Huang · Li Erran Li

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR ([Vi]sual [G]r[o]unding Through Fine-Grained [R]eward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we released our human annotation (https://github.com/amazon-science/vigor) comprising 15,440 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.

# 78

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Yunhao Gou · Kai Chen · Zhili LIU · Lanqing Hong · Hang Xu · ZHENGUO LI · Dit-Yan Yeung · James Kwok · Yu Zhang

Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct safe MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of the pre-aligned LLMs in MLLMs. Experiments with five state-of-the-art (SOTA) MLLMs demonstrate that ECSO significantly enhances model safety (e.g., a 37.6% improvement on MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we demonstrate that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for the alignment of MLLMs without extra human intervention.

# 75

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

Hao Cheng · Erjia Xiao · Jindong Gu · Le Yang · Jinhao Duan · Jize Zhang · Jiahang Cao · Kaidi Xu · Renjing Xu

Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-Language Models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07\% to 13.90\%.

# 85

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Byung-Kwan Lee · Beomchan Park · Chae Won Kim · Yong Man Ro

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence—(1) visual features, (2) auxiliary features from the external CV models, and (3) language features—utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

# 94

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Jing Zhang · Liang Zheng · Meng Wang · Dan Guo

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). Code will be made publicly available.

# 64

Strong Double Blind

Quantized Prompt for Efficient Generalization of Vision-Language Models

Tianxiang Hao · Xiaohan Ding · Juexiao Feng · Yuhong Yang · Hui Chen · Guiguang Ding

In the past few years, large-scale pre-trained vision-language models like CLIP have achieved tremendous success in various fields. Naturally, how to transfer the rich knowledge in such huge pre-trained models to downstream tasks and datasets becomes a hot topic. During downstream adaptation, the most challenging problems are overfitting and catastrophic forgetting, which can cause the model to overly focus on the current data and lose more crucial domain-general knowledge. Existing works use classic regularization techniques to solve the problems. As solutions become increasingly complex, the ever-growing storage and inference costs are also a significant problem that urgently needs to be addressed. While in this paper, we start from an observation that proper random noise can suppress overfitting and catastrophic forgetting. Then we regard quantization error as a kind of noise, and explore quantization for regularizing vision-language model, which is quite efficiency and effective. Furthermore, to improve the model's generalization capability while maintaining its specialization capacity at minimal cost, we deeply analyze the characteristics of the weight distribution in prompts, conclude several principles for quantization module design and follow such principles to create several competitive baselines. The proposed method is significantly efficient due to its inherent lightweight nature, making it possible to adapt on extremely resource-limited devices. Our method can be fruitfully integrated into many existing approaches like MaPLe, enhancing accuracy while reducing storage overhead, making it more powerful yet versatile. Extensive experiments on 11 datasets shows great superiority of our method sufficiently.

# 89

Strong Double Blind

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Ofir Abramovich · Niv Nayman · Sharon Fogel · Inbal Lavi · Ron Litman · Shahar Tsiper · Royee Tichauer · srikar appalaraju · Shai Mazor · R. Manmatha

In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

# 213

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Agneet Chatterjee · Gabriela Ben Melech Stan · Estelle Guez Aflalo · Sayak Paul · Dhruba Ghosh · Tejas Gokhale · Ludwig Schmidt · Hanna Hajishirzi · Vasudev Lal · Chitta R Baral · Yezhou Yang

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only 0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We will publicly release all our code, data, and models.

# 243

Strong Double Blind

MultiGen: Zero-shot Image Generation from Multi-modal Prompts

Zhi-Fan Wu · Lianghua Huang · Wei Wang · Yanheng Wei · Yu Liu

The field of text-to-image generation has witnessed substantial advancements in the preceding years, allowing the generation of high-quality images based solely on text prompts. However, accurately describing objects through text alone is challenging, necessitating the integration of additional modalities like coordinates and images for more precise image generation. Existing methods often require fine-tuning or only support using single object as the constraint, leaving the zero-shot image generation from multi-object multi-modal prompts as an unresolved challenge. In this paper, we propose MultiGen, a novel method designed to address this problem. Given an image-text pair, we obtain object-level text, coordinates and images, and integrate the information into an "augmented token" for each object. The augmented tokens serve as additional conditions and are trained alongside text prompts in the diffusion model, enabling our model to handle multi-object multi-modal prompts. To manage the absence of modalities during inference, we leverage a coordinate model and a feature model to generate object-level coordinates and features based on text prompts. Consequently, our method can generate images from text prompts alone or from various combinations of multi-modal prompts. Through extensive qualitative and quantitative experiments, we demonstrate that our method not only outperforms existing methods but also enables a wide range of tasks.

# 70

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

Tongkun Guan · Wei Shen · Xue Yang · Xuehui Wang · Xiaokang Yang

Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images. GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.59%, 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of DPText, FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code will be released soon.

# 74

VeCLIP: Improving CLIP Training via Visual-enriched Captions

Zhengfeng Lai · Haotian Zhang · Bowen Zhang · Wentao Wu · Haoping Bai · Aleksei Timofeev · Xianzhi Du · Zhe Gan · Jiulong Shan · Chen-Nee Chuah · Yinfei Yang · Meng Cao

Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective pipeline, we effortlessly scale our dataset up to 300 million samples named VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves up to +25.2% gain in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, VeCLIP achieves +3% gain while only using 14% of the data employed in the vanilla CLIP and 11% in ALIGN. We also note the VeCap data is complementary with other well curated datasets good for zero-shot classification tasks. When combining VeCap and DFN, our model can achieve strong performance on both of image-text retrieval and zero-shot classification tasks, e.g. 83.1% accuracy@1 on ImageNet zero-shot for a H/14 model.

# 79

ControlCap: Controllable Region-level Captioning

Yuzhong Zhao · Liu Yue · Zonghao Guo · Weijia Wu · Chen Gong · Qixiang Ye · Fang Wan

Region-level captioning is challenged by the caption degeneration issue, which refers to that pre-trained multimodal models tend to predict the most frequent captions but miss the less frequent ones. In this study, we propose a controllable region-level captioning (ControlCap) approach, which introduces control words to a multimodal model to address the caption degeneration issue. In specific, ControlCap leverages a discriminative module to generate control words within the caption space to partition it to multiple sub-spaces. The multimodal model is constrained to generate captions within a few sub-spaces containing the control words, which increases the opportunity of hitting less frequent captions, alleviating the caption degeneration issue. Furthermore, interactive control words can be given by either a human or an expert model, which enables captioning beyond the training caption space, enhancing the model’s generalization ability. Extensive experiments on Visual Genome and RefCOCOg datasets show that ControlCap respectively improves the CIDEr score by 21.6 and 2.2, outperforming the state-of-the-arts by significant margins. The code is enclosed in the supplementary material

# 36

Strong Double Blind

Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models

MENGYU ZHENG · Yehui Tang · Zhiwei Hao · Kai Han · Yunhe Wang · Chang Xu

Multi-modal models such as CLIP possess remarkable zero-shot transfer capabilities, making them highly effective in continual learning tasks. However, this advantage is severely compromised by catastrophic forgetting, which undermines the valuable zero-shot learning abilities of these models. Existing methods predominantly focus on preserving zero-shot capabilities but often fall short in fully exploiting the rich modal information inherent in multi-modal models. In this paper, we propose a strategy to enhance both the zero-shot transfer ability and adaptability to new data distribution. We introduce a novel graph-based multi-modal proximity distillation approach that preserves the intra- and inter-modal information for visual and textual modalities. This approach is further enhanced with a sample re-weighting mechanism, dynamically adjusting the influence of teachers for each individual sample. Experimental results demonstrate a considerable improvement over existing methodologies, which illustrate the effectiveness of the proposed method in the field of continual learning.

# 171

Strong Double Blind

Look Hear: Gaze Prediction for Speech-directed Human Attention

Sounak Mondal · Seoyoung Ahn · Zhibo Yang · Niranjan Balasubramanian · Dimitris Samaras · Gregory Zelinsky · Minh Hoai

For computer systems to effectively interact with humans using spoken language, they need to understand how the words being generated affect the users' moment-by-moment attention. Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression defining the object in the scene that should be fixated by gaze. To predict the gaze scanpaths in this incremental object referral task, we developed the Attention in Referral Transformer model or ART, which predicts the human fixations spurred by each word in a referring expression. ART uses a multimodal transformer encoder to jointly learn gaze behavior and its underlying grounding tasks, and an autoregressive transformer decoder to predict, for each word, a variable number of fixations based on fixation history. To train ART, we created RefCOCO-Gaze, a large-scale dataset of 19,738 human gaze scanpaths, corresponding to 2,094 unique image-expression pairs, from 220 participants performing our referral task. In our quantitative and qualitative analyses, ART not only outperforms existing methods in scanpath prediction, but also appears to capture several human attention patterns, such as waiting, scanning, and verification.

# 76

Strong Double Blind

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

Ting Lei · Shaofeng Yin · Yuxin Peng · Yang Liu

Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection. Different from traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction and generalizable interaction classification, respectively. Specifically, we integrate prior knowledge of different granularity into conditional vision prompts, including an input-conditioned instance-level prior and a global spatial configuration pattern prior. The former encourages the image encoder to treat instances belonging to seen or potentially unseen HOI concepts equally while the latter provides representative plausible spatial configuration of the human and object under interaction. Besides, we employ language-aware prompt learning with a consistency constraint to preserve the knowledge of the large foundation model to enable better generalization in the text branch. Extensive experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.

# 58

Strong Double Blind

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Yabin Zhang · Wenjie Zhu · Chenhang He · Yabin Zhang

Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demands domain expertise and is sensitive to linguistic nuances. In this paper, we introduce \textbf{L}abel-dirven \textbf{A}utomated \textbf{P}rompt \textbf{T}uning (LAPT), a novel approach to OOD detection that reduces the need for manual prompt engineering. We develop distribution-aware prompts with in-distribution (ID) class names and negative labels mined automatically. Training samples linked to these class labels are collected autonomously via image synthesis and retrieval methods, allowing for prompt learning without manual effort. We utilize a simple cross-entropy loss for prompt optimization, with cross-modal and cross-distribution mixing strategies to reduce image noise and explore the intermediate space between distributions, respectively. The LAPT framework operates autonomously, requiring only ID class names as input and eliminating the need for manual intervention. With extensive experiments, LAPT consistently outperforms manually crafted prompts, setting a new standard for OOD detection. Moreover, LAPT not only enhances the distinction between ID and OOD samples, but also improves ID classification accuracy and strengthens generalization robustness to covariate shifts, resulting in outstanding performance in challenging full-spectrum OOD detection tasks. Codes will be released.

# 57

Strong Double Blind

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Hong Zhang · Yixuan Lyu · Qian Yu · Hanyang Liu · Huimin Ma · Yuan Ding · Yifan Yang

In the domain of Camouflaged Object Segmentation (COS), despite continuous improvements in segmentation performance, the underlying mechanisms of effective camouflage remain poorly understood, akin to a black box. To address this gap, we present the first comprehensive study to examine the impact of camouflage attributes on the effectiveness of camouflage patterns, offering a quantitative framework for the evaluation of camouflage designs. To support this analysis, we have compiled the first dataset comprising descriptions of camouflaged objects and their attribute contributions, termed COD-Text And X-attributions (COD-TAX). Moreover, drawing inspiration from the hierarchical process by which humans process information: from high-level textual descriptions of overarching scenarios, through mid-level summaries of local areas, to low-level pixel data for detailed analysis. We have developed a robust framework that combines textual and visual information for the task of COS, named Attribution CUe Modeling with Eye-fixation Network (ACUMEN). ACUMEN demonstrates superior performance, outperforming nine leading methods across three widely-used datasets. We conclude by highlighting key insights derived from the attributes identified in our study, and we will make our code publicly available.

# 81

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Tim Salzmann · Markus Ryll · Alex Bewley · Matthias Minderer

Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.

# 83

Strong Double Blind

Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation

lei wang · Zejian Yuan · Badong Chen

Current end-to-end Scene Graph Generation (SGG) relies solely on visual representations to separately detect sparse relations and entities in an image. This leads to the issue where the predictions of entities do not contribute to the prediction of relations, necessitating post-processing to assign corresponding subjects and objects to the predicted relations. In this paper, we introduce a sparse relationship matrix that bridges entity detection and relation detection. Our approach not only eliminates the need for relation matching, but also leverages the semantics and positional information of predicted entities to enhance relation prediction. Specifically, a multi-granularity sparse relationship matrix prediction network is proposed, which utilizes three gated pooling modules focusing on filtering negative samples at different granularities, thereby obtaining a sparse relationship matrix containing entity pairs most likely to form relations. Finally, a set of sparse, most probable subject-object pairs can be constructed and used for relation decoding. Experimental results on multiple datasets demonstrate that our method achieves a new state-of-the-art overall performance. Our code is available.

# 71

Strong Double Blind

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection

Xingyu Peng · Yan Bai · Chen Gao · Lirong Yang · Fei Xia · Beipeng Mu · Xiaofei Wang · Si Liu

Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes. Extensive work has been done to deal with the OVD for 2D RGB images, but the exploration of 3D OVD is still limited. Intuitively, lidar point clouds provide 3D information, both object level and scene level, to generate trustful detection results. However, previous lidar-based OVD methods only focus on the usage of object-level features, ignoring the essence of scene-level information. In this paper, we propose a Global-Local Collaborative Scheme (GLIS) for the lidar-based OVD task, which contains a local branch to generate object-level detection result and a global branch to obtain scene-level global feature. With the global-local information, a Large Language Model (LLM) is applied for chain-of-thought inference, and the detection result can be refined accordingly. We further propose Reflected Pseudo Labels Generation (RPLG) to generate high-quality pseudo labels for supervision and Background-Aware Object Localization (BAOL) to select precise object proposals. Extensive experiments on ScanNetV2 and SUN RGB-D demonstrate the superiority of our methods.

# 69

Strong Double Blind

Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

Pengfei Wang · Yuxi Wang · Shuai Li · Zhaoxiang Zhang · Zhen Lei · Yabin Zhang

The scarcity of large-scale 3D-text paired data poses a great challenge on open vocabulary 3D scene understanding, and hence it is popular to leverage internet-scale 2D data and transfer their open vocabulary capabilities to 3D models through knowledge distillation. However, the existing distillation-based 3D scene understanding approaches rely on the representation capacity of 2D models, disregarding the exploration of geometric priors and inherent representational advantages offered by 3D data. In this paper, we propose an effective approach, namely Geometry Guided Self-Distillation (GGSD), to learn superior 3D representations from 2D pre-trained models. Specifically, we first design a geometry guided distillation module to distill knowledge from 2D models, where we leverage the 3D geometric priors to alleviate inherent noise in 2D models and enhance the representation learning process. Due to the inherent representation advantages of 3D data, the performance of the distilled 3D student model can significantly surpass that of the 2D teacher model. This motivates us to further leverage the representation advantages of 3D data through self-distillation. As a result, our proposed GGSD approach outperforms the existing open vocabulary 3D scene understanding methods by a large margin, as demonstrated by our experiments on both indoor and outdoor benchmark datasets. The code will be released.

# 96

Strong Double Blind

SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding

Han Xiao · Wenzhao Zheng · Sicheng Zuo · Peng Gao · Jie Zhou · Jiwen Lu

Vision transformers have demonstrated promising results and are core components in many tasks. While existing works have explored diverse interaction or transformation modules to process image tokens, most of them still focus on context feature extraction, supplemented with the spatial information injected through additional positional embedding. However, the local positional information within each image token hinders effective spatial scene modeling, making the learned representation hard to directly adapt to downstream tasks, especially those that require high-resolution fine-tuning or 3D scene understanding. To solve this challenge, we propose SpatialFormer, an efficient vision transformer architecture designed to facilitate adaptive spatial modeling for generalizable image representation learning. Specifically, we accompany the image tokens with a set of adaptive spatial tokens to represent the context and spatial information respectively. Each spatial token is initialized with its positional encoding, augmented with learnable embeddings to introduce essential spatial priors that enhance the context features. We employ a decoder-only architecture to enable efficient interaction between the two types of tokens. Our approach learns transferable image representation with enhanced abilities for scene understanding. Moreover, the generated spatial tokens can serve as enhanced initial queries for task-specific decoders, facilitating adaptations to downstream tasks. Extensive experiments on standard image classification and downstream 2D and 3D perception tasks demonstrate the efficiency and transferability of the proposed SpatialFormer architecture.

# 86

Strong Double Blind

LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers

Ziling Huang · Shin’ichi Satoh

Given an image and text description, visual grounding will find target region in the image explained by the text.It has two task settings: referring expression comprehension (REC) to estimate bounding-box and referring expression segmentation (RES) to predict segmentation mask. Currently the most promising visual grounding approaches are to learn REC and RES jointly by giving rich ground truth of both bounding-box and segmentation mask of the target object. However, we argue that a very simple but strong constraint has been overlooked by the existing approaches: given an image and a text description, REC and RES refer to the same object. We propose \textbf{Lo}cation \textbf{A}ware \textbf{Trans}former(LoA-Trans) making this constraint explicit by a {\it center prompt}, where the system first predicts the center of the target object by simple regression, and feeds it as a common prompt to both REC and RES. In this way, the system constrains that REC and RES refer to the same object. To mitigate possible inaccuracies in center estimation, we introduce a query selection mechanism. Instead of random initialization queries for bounding-box and segmentation mask decoding, the query selection mechanism generates possible object locations other than the estimated center and use them as location-aware queries as a remedy for possible inaccurate center estimation. We also introduce a TaskSyn Network in the decoder to better coordination between REC and RES. Our method achieved state-of-the-art performance on three commonly used datasets: Refcoco, Refcoco+, and Refcocog. Extensive ablation studies demonstrated the validity of each of the proposed components. We will release the code later.

# 77

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang · Jieru Mei · Alan Yuille

Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual and textual features at an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to attain favorable pixel-level segmentation results. In this work, we investigate in CLIP's spatial reasoning mechanism and identify that its failure of dense prediction is caused by a location misalignment issue in the self-attention process. Based on this observation, we propose a training-free adaptation approach for CLIP's semantic segmentation, which only introduces a very simple modification to CLIP but can effectively address the issue of location misalignment. Specifically, we reform the self-attention mechanism with leveraging query-to-query and key-to-key similarity to determine attention scores. Remarkably, this minimal modification to CLIP significantly enhances its capability in dense prediction, improving the original CLIP's 14.1% average zero-shot mIoU over eight semantic segmentation benchmarks to 38.2%, and outperforming the existing SoTA's 33.9% by a large margin.

# 241

Strong Double Blind

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Haiyang Yu · Teng Fu · Bin Li · Xiangyang Xue

Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCOTS and MLTS) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training. The code and datasets are available at https://hyangyu.github.io/EAFormer/.

# 72

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

Monika Wysoczanska · Oriane Siméoni · Michaël Ramamonjisoa · Andrei Bursuc · Tomasz Trzciński · Patrick Pérez

The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose an open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, which are computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.

# 82

Strong Double Blind

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Byeonghyun Pak · Byeongju Woo · Sunghwan Kim · Dae-hwan Kim · Hoseong Kim

In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5→Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. Source code will be released.

# 95

Strong Double Blind

Attention Decomposition for Cross-Domain Semantic Segmentation

Liqiang He · Sinisa Todorovic

This work addresses cross-domain semantic segmentation. While recent encoder-heavy'' CNNs and transformers led to significant advances, we introduce a new transformer with a lighter encoder and more complex decoder with query tokens for predicting segmentation masks, called ADFormer. The domain gap between the source and target domains is reduced with two mechanisms. First, we decompose cross-attention in the decoder into domain-independent and domain-specific parts to enforce the query tokens interact with the domain-independent aspects of the image tokens, shared by the source and target domains, rather than domain-specific counterparts which induce the domain gap. Second, we use the gradient reverse block to control back-propagation of the gradient, and hence introduce adversarial learning in the decoder of ADFormer. Our results on two benchmark domain shifts -- GTA to Cityscapes and SYNTHIA to Cityscapes -- show that ADFormer outperforms SOTAencoder-heavy'' methods with significantly lower complexity.

# 168

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

Hanrong Ye · Jason Wen Yong Kuen · Qing Liu · Zhe Lin · Brian Price · Dan Xu

We present SegGen, a new data generation approach that pushes the performance boundaries of state-of-the-art image segmentation models. One major bottleneck of previous data synthesis methods for segmentation is the design of segmentation labeler module'', which is used to synthesize segmentation masks for images. The segmentation labeler modules, which are segmentation models by themselves, bound the performance of downstream segmentation models trained on the synthetic masks. These methods encounter achicken or egg dilemma'' and thus fail to outperform existing segmentation models. To address this issue, we propose a novel method that reverses the traditional data generation process: we first (i) generate highly diverse segmentation masks that match real-world distribution from text prompts, and then (ii) synthesize realistic images conditioned on the segmentation masks. In this way, we avoid the need for any segmentation labeler module. SegGen integrates two data generation strategies, namely MaskSyn and ImgSyn, to largely improve data diversity in synthetic masks and images. Notably, the high quality of our synthetic data enables our method to outperform the previous data synthesis method by +25.2 mIoU on ADE20K when trained with pure synthetic data. On the highly competitive ADE20K and COCO benchmarks, our data generation method markedly improves the performance of state-of-the-art segmentation models in semantic segmentation, panoptic segmentation, and instance segmentation. Moreover, experiments show that training with our synthetic data makes the segmentation models more robust towards unseen data domains, including real-world and AI-generated images.

# 100

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

Wouter Van Gansbeke · Bert De Brabandere

Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to handle the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture which omits these complexities. Our training process consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. The use of a generative model unlocks the exploration of mask completion or inpainting, which has applications in interactive segmentation. The experimental validation on COCO and ADE20k yields strong results for segmentation tasks. Finally, we demonstrate the approach's adaptability to a multi-task setting by introducing learnable task embeddings.

# 60

Strong Double Blind

MC-PanDA: Mask Confidence for Panoptic Domain Adaptation

Ivan Martinovic · Josip Šarić · Siniša Šegvić

Domain adaptive panoptic segmentation promises to resolve the long tail of corner cases in natural scene understanding. Most approaches involve consistency learning with Mean Teacher. Previous state of the art extends this baseline with cross-task consistency, careful system-level optimization and heuristic improvement of teacher predictions. In contrast, we propose to build upon remarkable capability of mask transformers to estimate their own prediction uncertainty. Our method favours training on confident pseudo-labels by leveraging fine-grained confidence of panoptic teacher predictions. In particular, we modulate the loss with mask-wide confidence and discourage back-propagation in pixels with uncertain mask assignment. Experimental evaluation on standard benchmarks reveals a substantial contribution of the proposed selection techniques. We report 47.4 PQ on Synthia to Citysapes which corresponds to an improvement of 6.2 percentage points over the state of the art.

# 80

Strong Double Blind

OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

Pranav Gupta · Rishubh Singh · Pradeep Shenoy · SANTOSH RAVI KIRAN SARVADEVABHATLA

Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of 3.3 (Pascal-Parts-58), 3.5 (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is 4.0. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets.

# 29

Strong Double Blind

Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation

Chang Liu · Giulia Rizzoli · Pietro Zanuttigh · Fu Li · Yi Niu

Current weakly-supervised incremental learning for semantic segmentation (WILSS) approaches only consider replacing pixel-level annotations with image-level labels, while the training images are still from well-designed datasets. In this work, we argue that widely available web images can also be considered for the learning of new classes. To achieve this, firstly we introduce a strategy to select web images which are similar to previously seen examples in the latent space using a Fourier-based domain discriminator. Then, an effective caption-driven reharsal strategy is proposed to preserve previously learnt classes. To our knowledge, this is the first work to rely solely on web images for both the learning of new concepts and the preservation of the already learned ones in WILSS. Experimental results show that the proposed approach can reach state-of-the-art performances without using manually selected and annotated data in the incremental steps.

# 33

Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation

Chongjie Si · Xuehui Wang · Xiaokang Yang · Wei Shen

Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.

# 26

Strong Double Blind

Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation

Wei Cong · Yang Cong · Yuyang Liu · Gan Sun

Incremental semantic segmentation endeavors to segment newly encountered classes while maintaining knowledge of old classes. However, existing methods either 1) lack guidance from class-specific knowledge (i.e., old class prototypes), leading to a bias towards new classes, or 2) constrain class-shared knowledge (i.e., old model weights) excessively without discrimination, resulting in a preference for old classes. In this paper, to trade off model performance, we propose the Class-specific and Class-shared Knowledge (Cs2K) guidance for incremental semantic segmentation. Specifically, from the class-specific knowledge aspect, we design a prototype-guided pseudo labeling that exploits feature proximity from prototypes to correct pseudo labels, thereby overcoming catastrophic forgetting. Meanwhile, we develop a prototype-guided class adaptation that aligns class distribution across datasets via learning old augmented prototypes. Moreover, from the class-shared knowledge aspect, we propose a weight-guided selective consolidation to strengthen old memory while maintaining new memory by integrating old and new model weights based on weight importance relative to old classes. Experiments on public datasets demonstrate that our proposed Cs2K significantly improves segmentation performance and is plug-and-play.

# 66

Strong Double Blind

ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation

Yuyuan Liu · Yuanhong Chen · Hu Wang · Vasileios Belagiannis · Ian Reid · Gustavo Carneiro

The costly and time-consuming annotation process to produce large training sets for modelling semantic LiDAR segmentation methods has motivated the development of semi-supervised learning (SSL) methods. However, such SSL approaches often concentrate on employing consistency learning only for individual LiDAR representations. This narrow focus results in limited perturbations that generally fail to enable effective consistency learning. Additionally, these SSL approaches employ contrastive learning based on the sampling from a limited set of positive and negative embedding samples, rather than considering a more effective sampling from a distribution of positive and negative embeddings. This paper introduces a novel semi-supervised LiDAR semantic segmentation framework called ItTakesTwo (IT2). IT2 is designed to ensure consistent predictions from peer LiDAR representations, thereby improving the perturbation effectiveness in consistency learning. Furthermore, our contrastive learning employs informative samples drawn from a distribution of positive and negative embeddings learned from the entire training set. Results on public benchmarks show that our approach achieves remarkable improvements over the previous state-of-the-art (SOTA) methods in the field. Code will be available.

# 68

Strong Double Blind

On-the-fly Category Discovery for LiDAR Semantic Segmentation

HYEONSEONG KIM · Sung-Hoon Yoon · Minseok Kim · Kuk-Jin Yoon

LiDAR semantic segmentation is important for understanding the surrounding environment in autonomous driving. Existing methods assume closed-set situations with the same training and testing label space. However, in the real world, unknown classes not encountered during training may appear during testing, making it difficult to apply existing methodologies. In this paper, we propose a novel on-the-fly category discovery method for LiDAR semantic segmentation, aiming to classify and segment both unknown and known classes instantaneously during test time, achieved solely by learning with known classes in training. To embed instant segmentation capability in an inductive setting, we adopt a hash coding-based model with an expandable prediction space as a baseline. Based on this, dual prototypical learning is proposed to enhance the recognition of the known classes by reducing the sensitivity to intra-class variance. Additionally, we propose a novel mixing-based category learning framework based on representation mixing to improve the discovery capability of unknown classes. The proposed mixing-based framework effectively models out-of-distribution representations and learns to semantically group them during training, while distinguishing them from in-distribution representations. Extensive experiments on SemanticKITTI and SemanticPOSS datasets demonstrate the superiority of the proposed methods compared to the baselines. The code will be released.

# 88

Strong Double Blind

CONDA: Condensed Deep Association Learning for Co-Salient Object Detection.

Long Li · Nian Liu · Dingwen Zhang · Zhongyu Li · Salman Khan · Rao M Anwer · Hisham Cholakkal · Junwei Han · Fahad Shahbaz Khan

Inter-image association modeling is crucial for co-salient object detection. Despite the satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. This is because most of them focus on image feature optimization under the guidance of heuristically calculated raw inter-image associations. They directly rely on raw associations which are not reliable in complex scenarios and their image feature optimization approach is not explicit for inter-image association modeling. To alleviate these limitations, this paper propose a deep association learning strategy that deploy deep networks on raw associations to explicitly transform them into deep association features. Specifically, we first create hyperassociations to collect dense pixel-pair-wise raw associations and then deploys deep aggregation networks on them. We design a progressive association generation module for this purpose with additional enhancement of the hyperassociation calculation. More importantly, we propose a correspondence-induced association condensation module that introduces a pretext task, i.e. semantic correspondence estimation, to condense the hyperassociations for computational burden reduction and noise elimination. We also design an object-aware cycle consistency loss for high-quality correspondence estimations. Experimental results on three benchmark datasets demonstrate the remarkable effectiveness of our proposed method with various training settings.

# 108

Strong Double Blind

General Geometry-aware Weakly Supervised 3D Object Detection

Guowen Zhang · Junsong Fan · Liyi Chen · Zhaoxiang Zhang · Zhen Lei · Yabin Zhang

3D object detection is an indispensable component for scene understanding. However, the annotation of large-scale 3D datasets requires significant human effort. In tackling this problem, many methods adopt weakly supervised 3D object detection that estimates 3D boxes by leveraging 2D boxes and scene/class-specific priors. However, these approaches generally depend on sophisticated manual priors, which results in poor transferability to novel categories and scenes. To this end, we were motivated to propose a general approach, which can be easily transferred to new scenes and/or classes. We propose a unified framework for learning 3D object detectors from weak 2D boxes obtained by associated RGB images. To solve the ill-posed problem of estimating 3D boxes from 2D boxes, we propose three general components, including the prior injection module, the 2D space projection constraint, and the 3D space geometry constraint. We minimize the discrepancy between the boundaries of projected 3D boxes and their corresponding 2D boxes on the image plane. In addition, we incorporate a semantic ratio loss and Point-to-Box alignment loss to refine the pose of estimated 3D boxes. Experiments on Kitti and SUN-RGBD datasets demonstrate that the designed loss can yield surprisingly high-quality 3D bounding boxes with only 2D annotation. Code will be released.

# 45

Strong Double Blind

CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection

Xunfa Lai · Zhiyu Yang · Jie Hu · ShengChuan Zhang · liujuan cao · Guannan Jiang · Songan Zhang · zhiyu wang · Rongrong Ji

Existing camouflaged object detection~(COD) methods depend heavily on extensive pixel-level annotations. However, acquiring such annotations is laborious due to the inherent camouflage characteristics of the objects. Semi-supervised learning offers a promising solution to this challenge. Yet, its application in COD is hindered by significant pseudo-label noise, both pixel-level and instance-level. We introduce CamoTeacher, a novel semi-supervised COD framework, utilizing Dual-Rotation Consistency Learning~(DRCL) to effectively address these noise issues. Specifically, DRCL minimizes pseudo-label noise by leveraging rotation views' consistency in pixel-level and instance-level. First, it employs Pixel-wise Consistency Learning~(PCL) to deal with pixel-level noise by reweighting the different parts within the pseudo-label. Second, Instance-wise Consistency Learning~(ICL) is used to adjust weights for pseudo-labels, which handles instance-level noise. Extensive experiments on four COD benchmarks demonstrate that the proposed CamoTeacher not only achieves state-of-the-art compared with semi-supervised learning methods, but also rivals established fully-supervised learning methods. Our code will be available soon.

# 43

Strong Double Blind

MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks

Sanbao Su · Xin Li · Thang Doan · Sima Behpour · Wenbin He · Liang Gou · Fei Miao · Liu Ren

In this study, we investigate the task of active testing for label-efficient evaluation, which aims to estimate a model's performance on an unlabeled test dataset with a limited annotation budget. Previous approaches relied on deep ensemble models to identify highly informative instances for labeling, but fell short in dense recognition tasks like segmentation and object detection due to their high computational costs. In this work, we present MetaAT, a simple yet effective approach that adapts a Vision Transformer as a Meta Model for active testing. Specifically, we introduce a region loss estimation head to identify challenging regions for more accurate and informative instance acquisition. More importantly, the design of MetaAT allows it to handle annotation granularity at the region level, significantly reducing annotation costs in dense recognition tasks. As a result, our approach demonstrates consistent and substantial performance improvements over five popular benchmarks compared with state-of-the-art methods. Notably, on the CityScapes dataset, MetaAT achieves a 1.36% error rate in performance estimation using only 0.07% of annotations, marking a 10X improvement over existing state-of-the-art methods. To the best of our knowledge, MetaAT represents the first framework for active testing of dense recognition tasks.

# 44

Strong Double Blind

Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights

Yan Hao · Florent Forest · Olga Fink

This paper focuses on source-free domain adaptation for object detection in computer vision. This task is challenging and of great practical interest, due to the cost of obtaining annotated data sets for every new domain. Recent research has proposed various solutions for Source-Free Object Detection (SFOD), most being variations of teacher-student architectures with diverse feature alignment, regularization and pseudo-label selection strategies. Our work investigates simpler approaches and their performance compared to more complex SFOD methods in several adaptation scenarios. We highlight the importance of batch normalization layers in the detector backbone, and show that adapting only the batch statistics is a strong baseline for SFOD. We propose a simple extension of a Mean Teacher with strong-weak augmentation in the source-free setting, Source-Free Unbiased Teacher (SF-UT), and show that it actually outperforms most of the previous SFOD methods. Additionally, we showcase that an even simpler strategy consisting in training on a fixed set of pseudo-labels can achieve similar performance to the more complex teacher-student mutual learning, while being computationally efficient and mitigating the major issue of teacher-student collapse. We conduct experiments on several adaptation tasks using benchmark driving datasets including (Foggy)Cityscapes, Sim10k and KITTI, and achieve a notable improvement of 4.7\% AP50 on Cityscapes$\rightarrow$Foggy-Cityscapes compared with the latest state-of-the-art in SFOD. Source code will be released upon acceptance.

# 104

Strong Double Blind

Rethinking Features-Fused-Pyramid-Neck for Object Detection

Hulin Li

Multi-head detectors are widely used in the industry for multi-scale detection. These detectors typically employ a convention of using a features-fused-pyramid-neck. However, the features-fused-pyramid-neck encounters feature misalignment when representations from different level hierarchies are forcibly fused point-to-point. To address this issue, we propose an independent hierarchy pyramid (IHP) architecture to evaluate the effectiveness of the features-unfused-pyramid-neck for multi-head detectors. Subsequently, we introduce a soft nearest neighbor interpolation (SNI) to address feature misalignment, utilizing a weight-downscaling factor to minimize the fusion impact on different level features. Furthermore, we introduce spatial window extension for down-sampling (ESD) to retain spatial features and propose an enhanced lightweight convolutional technique. Finally, building upon the above advancements, we design a secondary features alignment solution (SANs) for real-time detection and achieve state-of-the-art results on the Pascal VOC and MS COCO. Code will be released soon.

# 110

3D Small Object Detection with Dynamic Spatial Pruning

Xiuwei Xu · Zhihao Sun · Ziwei Wang · Hongmin Liu · Jie Zhou · Jiwen Lu

In this paper, we propose an efficient feature pruning strategy for 3D small object detection. Conventional 3D object detection methods struggle on small objects due to the weak geometric information from a small number of points. Although increasing the spatial resolution of feature representations can improve the detection performance on small objects, the additional computational overhead is unaffordable. With in-depth study, we observe the growth of computation mainly comes from the upsampling operation in the decoder of 3D detector. Motivated by this, we present a multi-level 3D detector named DSPDet3D which benefits from high spatial resolution to achieves high accuracy on small object detection, while reducing redundant computation by only focusing on small object areas. Specifically, we theoratically derive a dynamic spatial pruning (DSP) strategy to prune the redundant spatial representation of 3D scene in a cascade manner according to the distribution of objects. Then we design DSP module following this strategy and construct DSPDet3D with this efficient decoder. On ScanNet and TO-SCENE dataset, our method achieves leading performance on small object detection. Moreover, DSPDet3D trained with only ScanNet rooms can generalize well to scenes in larger scale. It takes less than 2s to directly process a whole building consisting of more than 4500k points while detecting out almost all objects, ranging from cups to beds, on a single RTX 3090 GPU.

# 102

Strong Double Blind

Watching it in Dark: A Target-aware Representation Learning Framework for High-Level Vision Tasks in Low Illumination

Yunan LI · Yihao Zhang · Shoude Li · Long Tian · DOU QUAN · Chaoneng Li · Qiguang Miao

Low illumination significantly impacts the performance of learning-based models trained in well-lit conditions. Although current methods alleviate this issue through either image-level enhancement or feature-level adaptation, they often focus solely on the image itself, ignoring how the task-relevant target varies along with different illumination. In this paper, we propose a target-aware representation learning framework designed to improve high-level task performance in low-illumination environments. We first achieve a bi-directional domain alignment from both image appearance and semantic features to bridge data under varying illumination conditions. To better focus on the target itself, we design a target highlighting strategy, incorporated with the saliency mechanism and Temporal Gaussian Mixture Model to emphasize the location and movement of task-relevant targets. We also design a mask token-based representation learning scheme to learn a more robust target-aware feature. Our framework ensures more robust and effective feature representation for high-level vision tasks in low illumination. Extensive experiments conducted on CODaN, ExDark, and ARID datasets validate the effectiveness of our method for both image and video-based tasks, such as classification, detection, and action recognition. The code will be released upon acceptance.

# 32

Strong Double Blind

Gradient-Aware for Class-Imbalanced Semi-supervised Medical Image Segmentation

Wenbo Qi · Jiafei Wu · S. C. Chan

Class imbalance poses a significant challenge in semi-supervised medical image segmentation (SSLMIS). Existing techniques face problems such as poor performance on tail classes, instability, and slow convergence speed. We propose a novel Gradient-Aware (GA) method, structured on a clear paradigm: identify extrinsic data-bias → analyze intrinsic gradient-bias → propose solutions, to address this issue. Through theoretical analysis, we identify the intrinsic gradient bias instigated by extrinsic data bias in class-imbalanced SSMIS. To combat this, we propose a GA loss, featuring GADice loss, which leverages a probability-aware gradient for absent classes, and GACE, designed to alleviate gradient bias through class equilibrium and dynamic weight equilibrium. Our proposed method is plug-and-play, simple yet very effective and robust, exhibiting a fast convergence speed. Comprehensive experiments on three public datasets (CT&MRI, 2D&3D) demonstrate our method's superior performance, significantly outperforming other SOTA of SSLMIS and class-imbalanced designs (e.g. + 17.90% with CPS on 20% labeled Synapse). Code is available at *.

# 269

Strong Double Blind

Test-Time Stain Adaptation with Diffusion Models for Histopathology Image Classification

Cheng-Chang Tsai · Yuan-Chih Chen · Chun-Shien Lu

Stain shifts are prevalent in histopathology images, and typically dealt with by normalization or augmentation. Considering training-time methods are limited in dealing with unseen stains, we propose a test-time stain adaptation method (TT-SaD) with diffusion models that achieves stain adaptation by solving a nonlinear inverse problem during testing. TT-SaD is promising in that it only needs a single domain for training but can adapt well from other domains during testing, preventing models from retraining whenever there are new data available. For tumor classification, stain adaptation by TT-SaD outperforms state-of-the-art diffusion model-based test-time methods. Moreover, TT-SaD beats training-time methods when testing on data that are inaccessible during training. To our knowledge, the study of stain adaptation in diffusion model during testing time is relatively unexplored.

# 91

Strong Double Blind

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Pingyi Chen · Chenglu Zhu · Sunyi Zheng · Honglin Li · Lin Yang

Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images. The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs by generative visual question answering. WSI-VQA shows universality by reframing various kinds of slide-level tasks in a question-answering pattern, in which pathologists can achieve immunohistochemical grading, survival prediction, and tumor subtyping following human-machine interaction. Furthermore, we establish a WSI-VQA dataset which contains 8672 slide-level question-answering pairs with 977 WSIs. Besides the ability to deal with different slide-level tasks, our generative model (W2T) outperforms existing discriminative models in medical correctness, which reveals the potential of our model to be applied in the clinical scenario. Additionally, we also visualize the co-attention mapping between word embeddings and WSIs as an intuitive explanation for diagnostic results. The dataset and related code will be public.

# 90

ChEX: Interactive Localization and Region Description in Chest X-rays

Philip Müller · Georgios Kaissis · Daniel Rueckert

Report generation models offer fine-grained textual interpretations of medical images like chest X-rays, yet they often lack interactivity (i.e. the ability to steer the generation process through user queries) and localized interpretability (i.e. visually grounding their predictions), which we deem essential for future adoption in clinical practice. While there have been efforts to tackle these issues, they are either limited in their interactivity by not supporting textual queries or fail to also offer localized interpretability. Therefore, we propose a novel multitask architecture and training paradigm integrating textual prompts and bounding boxes for diverse aspects like anatomical regions and pathologies. We call this approach the Chest X-Ray Explainer (ChEX). Evaluations across a heterogeneous set of 9 chest X-ray tasks, including localized image interpretation and report generation, showcase its competitiveness with SOTA models while additional analysis demonstrates ChEX’s interactive capabilities. Code will be made available upon acceptance.

# 280

Strong Double Blind

A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization

Qiyu Chen · Huiyuan Luo · Chengkan Lv · Zhengtao Zhang

Anomaly synthesis strategies can effectively enhance unsupervised anomaly detection. However, existing strategies have limitations in the coverage and controllability of anomaly synthesis, particularly for weak defects that are very similar to normal regions. In this paper, we propose Global and Local Anomaly co-Synthesis Strategy (GLASS), a novel unified framework designed to synthesize a broader coverage of anomalies under the manifold and hypersphere distribution constraints of Global Anomaly Synthesis (GAS) at the feature level and Local Anomaly Synthesis (LAS) at the image level. Our method synthesizes near-in-distribution anomalies in a controllable way using Gaussian noise guided by gradient ascent and truncated projection. GLASS achieves state-of-the-art results on the MVTec AD (detection AUROC of 99.9%), VisA, and MPDD datasets and excels in weak defect detection. The effectiveness and efficiency have been further validated in industrial applications for woven fabric defect detection. The code and dataset will be released in the future.

# 59

Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Yuanpeng Tu · Boshen Zhang · Liang Liu · YUXI LI · Jiangning Zhang · Yabiao Wang · Chengjie Wang · cairong zhao

Industrial anomaly detection is generally addressed as an unsupervised task that aims at locating defects with only normal training samples. Recently, numerous 2D anomaly detection methods have been proposed and have achieved promising results, however, using only the 2D RGB data as input is not sufficient to identify imperceptible geometric surface anomalies. Hence, in this work, we focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets, i.e., ImageNet, to construct feature databases. And we empirically find that directly using these pre-trained models is not optimal, it can either fail to detect subtle defects or mistake abnormal features as normal ones. This may be attributed to the domain gap between target industrial data and source data. Towards this problem, we propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection. Both intra-modal adaptation and cross-modal alignment are optimized from a local-to-global perspective in LSFA to ensure the representation quality and consistency in the inference stage. Extensive experiments demonstrate that our method not only brings a significant performance boost to feature embedding based approaches, but also outperforms previous State-of-The-Art (SoTA) methods prominently on both MVTec-3D AD and Eyecandies datasets, e.g., LSFA achieves 97.1% I-AUROC on MVTec-3D, surpass previous SoTA by +3.4%.

# 175

Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes

Zelong Zeng · Kaname Tomite

In anomaly segmentation for complex driving scenes, state-of-the-art approaches utilize anomaly scoring functions to calculate anomaly scores. For these functions, accurately predicting the logits of inlier classes for each pixel is crucial for precisely inferring the anomaly score. However, in real-world driving scenarios, the diversity of scenes often results in distorted manifolds of pixel embeddings in embedding space. This effect is not conducive to directly using the pixel embeddings for the logit prediction during inference, a concern overlooked by existing methods. To address this problem, we propose a novel method called Random Walk on Pixel Manifolds (RWPM). RWPM utilizes random walks to reveal the intrinsic relationships among pixels to refine the pixel embeddings. The refined pixel embeddings alleviate the distortion of manifolds, improving the accuracy of anomaly scores. Our extensive experiments show that RWPM consistently improve the performance of the existing anomaly segmentation methods and achieve the best results.

# 186

Strong Double Blind

FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation

Fan Qi · Ruijie Pan · Huaiwen Zhang · Changsheng Xu

The imperative for smart surveillance systems to robustly detect anomalies poses a unique challenge given the sensitivity of visual data and privacy concerns. We propose a novel Federated Learning framework for Video Anomaly Detection that operates under the constraints of data heterogeneity and privacy preservation. We utilize Federated Visual Consistency Clustering to group clients on the server side. Further innovation is realized with an Adaptive Semantic-Enhanced Distillation strategy that infuses public video knowledge into our framework. During this process, Large Language Models are utilized for semantic generation and calibration of public videos. These video-text pairs are then used to fine-tune a multimodal network, which serves as a teacher in updating the global model. This approach not only refines video representations but also increases sensitivity to anomalous events. Our extensive evaluations showcase FedVAD's proficiency in boosting unsupervised and weakly supervised anomaly detection, rivaling centralized training paradigms while preserving privacy. The code will be made available publicly at https://anonymous.4open.science/r/FedVAD-BF51.

# 2

Strong Double Blind

Efficient Training of Spiking Neural Networks with Multi-Parallel Implicit Stream Architecture

Zhigao Cao · Meng Li · Xiashuang Wang · Haoyu Wang · Fan Wang · Youjun Li · Zigang Huang

Spiking neural networks (SNNs) are a novel type of bio-plausible neural network with energy efficiency. However, SNNs are non-differentiable and the training memory costs increase with the number of simulation steps. To address these challenges, this work introduces an implicit training method for SNNs inspired by equilibrium models. Our method relies on the multi-parallel implicit stream architecture (MPIS-SNNs). In the forward process, MPIS-SNNs drive multiple fused parallel implicit streams (ISs) to reach equilibrium state simultaneously. In the backward process, MPIS-SNNs solely rely on a single-time-step simulation of SNNs, avoiding the storage of a large number of activations. Extensive experiments on N-MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 demonstrate that MPIS-SNNs exhibit excellent characteristics such as low latency, low memory cost, low firing rates, and fast convergence speed, and are competitive among latest efficient training methods for SNNs. Our code is available at an anonymized GitHub repository: https://anonymous.4open.science/r/MPIS-SNNs-3D32.

# 56

Strong Double Blind

DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation

Rakshith Subramanyam · Kowshik Thopalli · Vivek Sivaraman Narayanaswamy · Jayaraman J. Thiagarajan

In this paper, we focus on the problem of detecting samples that can lead to model failure under the classification setting. Failures can stem from various sources, such as spurious correlations between image features and labels, class imbalances in the training data, and covariate shifts between training and test distributions. Existing approaches often rely on classifier prediction scores and do not comprehensively identify all failure scenarios. Instead, we pose failure detection as the problem of identifying the discrepancies between the classifier and its enhanced version. We build such an enhanced model by infusing task-agnostic prior knowledge from a vision-language model (e.g., CLIP) that encodes general-purpose visual and semantic relationships. Unlike conventional training, our enhanced model, named the Prior Induced Model (PIM) learns to map the pre-trained model features to the VLM latent space and aligns the same with a set of pre-specified, fine-grained class-level attributes which are later aggregated to estimate the class prediction. We propose that such a training strategy allows the model to concentrate only on the task specific attributes while making predictions in lieu of the pre-trained model and also enables human-interpretable explanations for failure. We conduct extensive empirical studies on various benchmark datasets and baselines, observing substantial improvements in failure detection.

# 49

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

Xixu Hu · Runkai Zheng · Jindong Wang · Cheuk Hang Leung · Qi WU · Xing Xie

Vision Transformers (ViTs) are increasingly used in computer vision due to their high performance, but their vulnerability to adversarial attacks is a concern. Existing methods lack a solid theoretical basis, focusing mainly on empirical training adjustments. This study introduces SpecFormer, tailored to fortify ViTs against adversarial attacks, with theoretical underpinnings. We establish local Lipschitz bounds for the self-attention layer and propose the Maximum Singular Value Penalization (MSVP) to precisely manage these bounds By incorporating MSVP into ViTs' attention layers, we enhance the model's robustness without compromising training efficiency. SpecFormer, the resulting model, outperforms other state-of-the-art models in defending against adversarial attacks, as proven by experiments on CIFAR and ImageNet datasets.

# 54

SeiT++: Masked Token Modeling Improves Storage-efficient Training

Minhyun Lee · Song Park · Byeongho Heo · Dongyoon Han · Hyunjung Shim

Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks. However, achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements. This storage challenge is a critical bottleneck for scaling up models. A recent breakthrough by SeiT proposed the use of Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification. This approach achieved 90% of the performance of a model trained on full-pixel images with only 1% of the storage. While SeiT needs labeled data, its potential in scenarios beyond fully supervised learning remains largely untapped. In this paper, we extend SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training. Recognizing that self-supervised approaches often demand more data due to the lack of labels, we introduce TokenAdapt and ColorAdapt. These methods facilitate comprehensive token-friendly data augmentation, effectively addressing the increased data requirements of self-supervised learning. We evaluate our approach across various scenarios, including storage-efficient ImageNet-1k classification, fine-grained classification, ADE-20k semantic segmentation, and robustness benchmarks. Experimental results demonstrate consistent performance improvement in diverse experiments, validating the effectiveness of our method. Our code will be released publicly.

# 51

Strong Double Blind

AMD: Automatic Multi-step Distillation of Large-scale Vision Models

Cheng Han · Qifan Wang · Sohail A Dianat · Majid Rabbani · Raghuveer Rao · Yi Fang · Qiang Guan · Lifu Huang · Dongfang Liu

Transformer-based architectures have become the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of the models continues to scale up, model distillation becomes extremely important in various real applications, particularly on devices limited by computational resources. However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g, 10x compression rate. In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, our distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. We conduct extensive experiments on multiple image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. The findings consistently reveal that our approach outperforms several established baselines, paving a path for future knowledge distillation methods on large-scale vision models.

# 48

Stitched ViTs are Flexible Vision Backbones

Zizheng Pan · Jing Liu · Haoyu He · Jianfei Cai · Bohan Zhuang

Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate trainings and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling strategy that takes into account the underlying FLOPs distributions in the space for better sampling. Finally, we observe that learning stitching layers as a low-rank update plays an essential role on downstream tasks to stabilize training and ensure a good Pareto frontier. With extensive experiments on ImageNet-1K, ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone, achieving great advantages in both training efficiency and deployment flexibility.

# 46

Strong Double Blind

MetaAug: Meta-Data Augmentation for Post-Training Quantization

Cuong Pham · Hoang Anh Dung · Cuong Cao Nguyen · Trung Le · Dinh Q Phung · Gustavo Carneiro · Thanh-Toan Do

Post-Training Quantization (PTQ) has received significant attention because it requires only a small set of calibration data to quantize a full-precision model, which is more practical in real-world applications in which full access to a large training set is not available. However, it often leads to overfitting on the small calibration dataset. Several methods have been proposed to address this issue, yet they still rely on only the calibration set for the quantization and they do not validate the quantized model due to the lack of a validation set. In this work, we propose a novel meta-learning based approach to enhance the performance of post-training quantization. Specifically, to mitigate the overfitting problem, instead of only training the quantized model using the original calibration set without any validation during the learning process as in previous PTQ works, in our approach, we both train and validate the quantized model using two different sets of images. In particular, we propose a meta-learning based approach to jointly optimize a transformation network and a quantized model through bi-level optimization. The transformation network modifies the original calibration data and the modified data will be used as the training set to learn the quantized model with the objective that the quantized model achieves a good performance on the original calibration data. Extensive experiments on the widely used ImageNet dataset with different neural network architectures demonstrate that our approach outperforms the state-of-the-art PTQ methods.

# 10

Strong Double Blind

Straightforward Layer-wise Pruning for More Efficient Visual Adaptation

Ruizi Han · Jinglei Tang

Parameter-efficient transfer learning (PETL) aims to adapt large pre-trained models using limited parameters. While most PETL approaches update the added parameters and freeze pre-trained weights during training, the minimal impact of task-specific deep layers on cross-domain data poses a challenge as PETL cannot modify them, resulting in redundant model structures. Structural pruning effectively reduces model redundancy; however, common pruning methods often lead to an excessive increase in stored parameters due to varying pruning structures based on pruning rates and data. Recognizing the storage parameter volume issue, we propose a Straightforward layer-wise pruning method, called SLS, for pruning PETL-transferred models. By evaluating parameters from a feature perspective of each layer and utilizing clustering metrics to assess current parameters based on clustering phenomena in low-dimensional space, SLS facilitates informed pruning decisions. Our study reveals that layer-wise pruning, with a focus on storing pruning indices, addresses storage volume concerns. Notably, mainstream Layer-wise pruning methods may not be suitable for assessing layer importance in PETL-transferred models, where the majority of parameters are pre-trained and have limited relevance to downstream datasets. Comparative analysis against state-of-the-art PETL methods demonstrates that the pruned model achieved a notable balance between model throughput and accuracy. Moreover, SLS effectively reduces storage overhead arising from varying pruned structures while enhancing the accuracy and speed of pruned models compared to conventional pruning methods.

# 301

Strong Double Blind

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

Zihu Wang · Lingqiao Liu · Scott Ricardo Figueroa Weston · Samuel Tian · Peng Li

Self-Supervised Learning (SSL) has become a prominent approach for acquiring visual representations across various tasks, yet its application in fine-grained visual recognition (FGVR) is challenged by the intricate task of distinguishing subtle differences between categories. To overcome this, we introduce an novel strategy that boosts SSL's ability to extract critical discriminative features vital for FGVR. This approach creates synthesized data pairs to guide the model to focus on discriminative features critical for FGVR during SSL. We start by identifying non-discriminative features using two main criteria: features with low variance that fail to effectively separate data and those deemed less important by Grad-CAM induced from the SSL loss. We then introduce perturbations to these non-discriminative features while preserving discriminative ones. A decoder is employed to reconstruct images from both perturbed and original feature vectors to create data pairs. An encoder is trained on such generated data pairs to become invariant to variations in non-discriminative dimensions while focusing on discriminative features, thereby improving the model's performance in FGVR tasks. We demonstrate the promising FGVR performance of the proposed approach through extensive evaluation on a wide variety of datasets.

# 35

Strong Double Blind

Robust Multimodal Learning via Representation Decoupling

Shicai Wei · Yang Luo · Yuji Wang · Chunbo Luo

Multimodal learning robust to missing modality has attracted increasing attention due to its practicality. Existing methods tend to address it by learning a common subspace representation for different modality combinations. However, we reveal that they are sub-optimal due to their implicit constraint on intra-class representation. Specifically, the sample with different modalities within the same class will be forced to learn representations in the same direction. This hinders the model from capturing modality-specific information, resulting in insufficient learning. To this end, we propose a novel Decoupled Multimodal Representation Network (DMRNet) to assist robust multimodal learning. Specifically, DMRNet models the input from different modality combinations as a probabilistic distribution instead of a fixed point in the latent space, and samples embeddings from the distribution for the prediction module to calculate the task loss. As a result, the direction constraint from the loss minimization is blocked by the sampled representation. This relaxes the constraint on the inference representation and enables the model to capture the specific information for different modality combinations. Furthermore, we introduce a hard combination regularizer to prevent DMRNet from unbalanced training by guiding it to pay more attention to hard modality combinations. Finally, extensive experiments on multimodal classification and segmentation tasks demonstrate that the proposed DMRNet outperforms the state-of-the-art significantly.

# 42

Strong Double Blind

SUMix: Mixup with Semantic and Uncertain Information

Huafeng Qin · Xin Jin · Hongyu Zhu · Hongchao Liao · Mounim A. El Yacoubi · Xinbo Gao

Mixup data augmentation approaches have been applied for various tasks of deep learning to improve the generalization ability of deep neural networks. Some existing approaches CutMix, SaliencyMix, etc. randomly replace a patch in one image with patches from another to generate the mixed image. Similarly, the corresponding labels are linearly combined by a fixed ratio lambda by l. The objects in two images may be overlapped during the mixing process, so some semantic information is corrupted in the mixed samples. In this case, the mixed image does not match the mixed label information. Besides, such a label may mislead the deep learning model training, which results in poor performance. To solve this problem, we proposed a novel approach named SUMix to learn the mixing ratio as well as the uncertainty for the mixed samples during the training process. First, we design a learnable similarity function to compute an accurate mix ratio. Second, an approach is investigated as a regularized term to model the uncertainty of the mixed samples. We conduct experiments on five image benchmarks, and extensive experimental results imply that our method is capable of improving the performance of classifiers with different cutting-based mixup approaches.

# 40

Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

Zijun Long · Lipeng Zhuang · George W Killick · Richard Mccreadie · Gerardo Aragon-Camarasa · Paul Henderson

Human-annotated vision datasets inevitably contain a fraction of human mislabelled examples. While the detrimental effects of such mislabelling on supervised learning are well-researched, their influence on Supervised Contrastive Learning (SCL) remains largely unexplored. In this paper, we show that human-labelling errors not only differ significantly from synthetic label errors, but also pose unique challenges in SCL, different to those in traditional supervised learning methods. Specifically, our results indicate they adversely impact the learning process in the ~99% of cases when they occur as false positive samples. Existing noise-mitigating methods primarily focus on synthetic label errors and tackle the unrealistic setting of very high synthetic noise rates (40-80%), but they often underperform on common image datasets due to overfitting. To address this issue, we introduce a novel SCL objective with robustness to human-labelling errors, SCL-RHE. SCL-RHE is designed to mitigate the effects of real-world mislabelled examples, typically characterized by much lower noise rates (<5%). We demonstrate that SCL-RHE consistently outperforms state-of-the-art representation learning and noise-mitigating methods across various vision benchmarks, by offering improved resilience against human-labelling errors.

# 30

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Yu-Chu Yu · Chi-Pin Huang · Jr-Jen Chen · Kai-Po Chang · Yung-Hsuan Lai · Fu-En Yang · Yu-Chiang Frank Wang

Large-scale vision-language models (VLMs) have shown a strong zero-shot generalization capability on unseen-domain data. However, when adapting pre-trained VLMs to a sequence of downstream tasks, they are prone to forgetting previously learned knowledge and degrade their zero-shot classification capability. To tackle this problem, we propose a unique Selective Dual-Teacher Knowledge Transfer framework that leverages the most recent fine-tuned and the original pre-trained VLMs as dual teachers to preserve the previously learned knowledge and zero-shot capabilities, respectively. With only access to an unlabeled reference dataset, our proposed framework performs a selective knowledge distillation mechanism by measuring the feature discrepancy from the dual teacher VLMs. Consequently, our selective dual-teacher knowledge distillation would mitigate catastrophic forgetting of previously learned knowledge while preserving the zero-shot capabilities from pre-trained VLMs. Through extensive experiments on benchmark datasets, we show that our proposed framework is favorable against state-of-the-art continual learning approaches for preventing catastrophic forgetting and zero-shot degradation.

# 31

Strong Double Blind

SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning

Bac Nguyen · Stefan Uhlich · Fabien Cardinaux · Lukas Mauch · Marzieh Edraki · Aaron Courville

Handling distribution shifts from training data, known as out-of-distribution (OOD) generalization, poses a significant challenge in the field of machine learning. While a pre-trained vision-language model like CLIP has demonstrated remarkable zero-shot performance, further adaptation of the model to downstream tasks leads to undesirable degradation for OOD data. In this work, we introduce Sparse Adaptation for Fine-Tuning (SAFT), a method that prevents fine-tuning from forgetting the general knowledge in the pre-trained model. SAFT only updates a small subset of important parameters whose gradient magnitude is large, while keeping the other parameters frozen. SAFT is straightforward to implement and conceptually simple. Extensive experiments show that with only 0.1% of the model parameters, SAFT can significantly improve the performance of CLIP. It consistently outperforms baseline methods across several benchmarks. On the few-shot learning benchmark of ImageNet and its variants, SAFT gives a gain of 5.15% on average over the conventional fine-tuning method in OOD settings.

# 62

Strong Double Blind

Linking in Style: Understanding learned features in deep learning models

Maren Wehrheim · Pamela Osuna Vargas · Matthias Kaschube

Convolutional neural networks (CNNs) learn abstract features to perform object classification, but understanding these features remains challenging due to difficult-to-interpret results or high computational costs. We propose an automatic method to systematically visualize and objectively analyze learned features in CNNs. Specifically, we introduce a linking network that maps the penultimate layer of a pre-trained classifier to the latent space of a generative model (StyleGAN-XL), thereby enabling an interpretable, human-friendly visualization of the classifier's representations. Our findings indicate a congruent semantic order in both spaces, enabling a direct linear mapping between them. Training the linking network is computationally inexpensive and decoupled from training both the GAN and the classifier. We introduce an automatic pipeline that utilizes such GAN-based visualizations to quantify learned representations by analyzing activation changes in the classifier in the image domain. This quantification allows us to investigate learned representations in several thousand units simultaneously and to extract and visualize units selective for specific semantic concepts. Further, we illustrate how our method can quantify and interpret the classifier's decision boundary using counterfactual examples. Overall, our method offers systematic and objective perspectives on learned abstract representations in CNNs.

# 67

Strong Double Blind

Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort

Jeeyung Kim · Ze Wang · Qiang Qiu

Enhancing model interpretability can address spurious correlations by revealing how models draw their predictions. Concept Bottleneck Models (CBMs) can provide a principled way of disclosing and guiding model behaviors through human-understandable concepts, albeit at a high cost of human efforts in data annotation. In this paper, we leverage a synergy of multiple foundation models to construct CBMs with nearly no human effort. We discover undesirable biases in CBMs built on pre-trained models and propose a novel framework designed to exploit pre-trained models while being immune to these biases, thereby reducing vulnerability to spurious correlations. Specifically, our method offers a seamless pipeline that adopts foundation models for assessing potential spurious correlations in datasets, annotating concepts for images, and refining the annotations for improved robustness. We evaluate the proposed method on multiple datasets, and the results demonstrate its effectiveness in reducing model reliance on spurious correlations while preserving its interpretability.

# 39

Strong Double Blind

Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning

Zhiyu Wu · Jin shi Cui

Image-level weak-to-strong consistency serves as the predominant paradigm in semi-supervised learning (SSL) due to its simplicity and impressive performance. Nonetheless, this approach confines all perturbations to the image level and suffers from the excessive presence of naive samples, thus necessitating further improvement. In this paper, we introduce feature-level perturbation with varying intensities and forms to expand the augmentation space, establishing the image-feature weak-to-strong consistency paradigm. Furthermore, our paradigm develops a triple-branch structure, which facilitates interactions between both types of perturbations within one branch to boost their synergy. Additionally, we present a confidence-based identification strategy to distinguish between naive and challenging samples, thus introducing additional challenges exclusively for naive samples. Notably, our paradigm can seamlessly integrate with existing SSL methods. We apply the proposed paradigm to several representative algorithms and conduct experiments on multiple benchmarks, including both balanced and imbalanced distributions for labeled samples. The results demonstrate a significant enhancement in the performance of existing SSL algorithms.

# 27

Strong Double Blind

Strike a Balance in Continual Panoptic Segmentation

Jinpeng Chen · Runmin Cong · Yuxuan Luo · Horace Ho Shing Ip · Sam Kwong

This study explores the emerging area of continual panoptic segmentation, highlighting three key balances. First, we introduce past-class backtrace distillation to balance the stability of existing knowledge with the adaptability to new information. This technique retraces the features associated with past classes based on the final label assignment results, performing knowledge distillation targeting these specific features from the previous model while allowing other features to flexibly adapt to new information. Additionally, we introduce a class-proportional memory strategy, which aligns the class distribution in the replay sample set with that of the historical training data. This strategy maintains a balanced class representation during replay, enhancing the utility of the limited-capacity replay sample set in recalling prior classes. Moreover, recognizing that replay samples are annotated only for the classes of their original step, we devise balanced anti-misguidance losses, which combat the impact of incomplete annotations without incurring classification bias. Building upon these innovations, we present a new method named Balanced Continual Panoptic Segmentation (BalConpas). Our evaluation on the challenging ADE20K dataset demonstrates its superior performance compared to existing state-of-the-art methods.

# 37

Strong Double Blind

IGNORE: Information Gap-based False Negative Loss Rejection for Single Positive Multi-Label Learning

Gyeong Ryeol Song · Noo-ri Kim · Jin-Seop Lee · Jee-Hyong LEE

Single Positive Multi-Label Learning (SPML) is a multi-label classification task in which each image is assigned only one positive label but the other labels are not annotated. Most approaches for SPML assume unannotated labels as negatives (``Assumed Negative", AN). However, with this assumption, some positive labels are inevitably regarded as negative (false negative), resulting in model performance degradation. Therefore, identifying false negatives is the most important with AN assumption. Previous approaches identified false negative labels using the model outputs of assumed negative labels. However, models were trained with noisy negative labels, their outputs were not reliable. Therefore, it is necessary to consider effectively utilizing the most reliable information in SPML for identifying false negative labels. In this paper, we propose an Information Gap-based false negative Loss Rejection method (IG-LR) for SPML. We generate the masked image that all parts are removed except the discriminative area of the single positive label. It is reasonable that when there is no information of an object in the masked image, the model’s logit for that object is low. Based on this intuition, we identify the false negative labels if they have a significant model’s logit gap between masked image and original image. By rejecting false negatives in the model training, we can prevent the model from being biased to false negative labels, and build more reliable models. We evaluate our method on four datasets: Pascal VOC 2012, MS COCO, NUSWIDE, and CUB. Compared to the previous state-of-the-art methods in SPML, our method outperforms on most of the datasets.

# 34

Strong Double Blind

Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning

Jiahao Xiao · Ming-Kun Xie · Heng-Bo Fan · Gang Niu · Masashi Sugiyama · Sheng-Jun Huang

Semi-supervised multi-label learning (SSMLL) is a powerful framework for leveraging unlabeled data to reduce the expensive cost of collecting precise multi-label annotations. Unlike semi-supervised learning, one cannot select the most probable label as the pseudo-label in SSMLL due to multiple semantics contained in an instance. To solve this problem, the mainstream method developed an effective thresholding strategy to generate accurate pseudo-labels. Unfortunately, the method neglected the quality of model predictions and its potential impact on pseudo-labeling performance. In this paper, we propose a dual-perspective method to generate high-quality pseudo-labels. To improve the quality of model predictions, we perform dual-decoupling to boost the learning of correlative and discriminative features, while refining the generation and utilization of pseudo-labels. To obtain proper class-wise thresholds, we propose the metric-adaptive thresholding strategy to estimate the thresholds, which maximize the pseudo-label performance for a given metric on labeled data. Experiments on multiple benchmark datasets show the proposed method can achieve the state-of-the-art performance and outperform the comparative methods with a significant margin.

# 38

Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation

Arpit Garg · Cuong Cao Nguyen · RAFAEL FELIX · Thanh-Toan Do · Gustavo Carneiro

Deep learning faces a formidable challenge when handling noisy labels, as models tend to overfit samples affected by label noise. This challenge is further compounded by the presence of instance-dependent noise (IDN), a realistic form of label noise arising from ambiguous sample information. To address IDN, Label Noise Learning (LNL) incorporates a sample selection stage to differentiate clean and noisy-label samples. This stage uses an arbitrary criterion and a pre-defined curriculum that initially selects most samples as noisy and gradually decreases this selection rate during training. Such curriculum is sub-optimal since it does not consider the actual label noise rate in the training set. This paper addresses this issue with a new noise-rate estimation method that is easily integrated with most state-of-the-art (SOTA) LNL methods to produce a more effective curriculum. Synthetic and real-world benchmarks’ results demonstrate that integrating our approach with SOTA LNL methods improves accuracy in most cases.

# 41

Strong Double Blind

Learning to Distinguish Samples for Generalized Category Discovery

Fengxiang Yang · Pu Nan · Wenjing Li · Zhiming Luo · Shaozi Li · Niculae Sebe · Zhun Zhong

Generalized Category Discovery (GCD) utilizes labelled data from seen categories to cluster unlabelled samples from both seen and unseen categories. Previous methods have demonstrated that assigning pseudo-labels for representation learning is effective. However, these methods commonly predict pseudo-labels based on pairwise similarities, while the overall relationship among each instance's k-nearest neighbors (kNNs) is largely overlooked, leading to inaccurate pseudo-labeling. To address this issue, we introduce a Neighbor Graph Convolutional Network (NGCN) that learns to predict pairwise similarities between instances using only labelled data. NGCN explicitly leverages the relationships among each instance's \textit{k}NNs and is generalizable to samples of both seen and unseen classes. This helps produce more accurate positive samples by injecting the predicted similarities into subsequent clustering. Furthermore, we design a Cross-View Consistency Strategy (CVCS) to exclude samples with noisy pseudo-labels generated by clustering. This is achieved by comparing clusters from two different clustering algorithms. The filtered unlabelled data with pseudo-labels and the labelled data are then used to optimize the model through cluster- and instance-level contrastive objectives. The collaboration between NGCN and CVCS ensures the learning of a robust model, resulting in significant improvements in both seen and unseen class accuracies. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both generic and fine-grained GCD benchmarks.

# 22

Strong Double Blind

Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

Junha Song · Tae Soo Kim · Junha Kim · Gunhee Nam · Thijs Kooi · Choo Jaegul

This paper aims to adapt the model to the target environment by leveraging large unlabeled target data and small user feedback readily available in real-world applications. We find that existing semi-supervised domain adaptation (SemiSDA) methods often suffer from poorly improved adaptation performance when directly utilizing such data. We analyze this phenomenon via a novel concept called Negatively Biased Feedback (NBF), which stems from the observation that user feedback is more likely for data points where the model produces incorrect predictions. To leverage such feedback without the problem, we propose a scalable adapting approach, Class-space Defending, which can seamlessly combine with existing SemiSDA methods. This approach helps the SemiSDA method to adapt the model with a balanced supervised signal by utilizing our defending samples throughout the adaptation process. We demonstrate the problem caused by NBF and the efficacy of our approach across various benchmarks, including image classification, semantic segmentation, and a real-world medical imaging application. Our extensive experiments show that significant performance improvements can be achieved by integrating our approach with multiple state-of-the-art SemiSDA methods.

# 28

Strong Double Blind

HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation

Noranart Vesdapunt · Kah Kuen Fu · Yue Wu · Xu Zhang · Pradeep Natarajan

Recent advancement in the large-scale pre-training model (such as CLIP) has significantly improved unsupervised domain adaptation (UDA) by leveraging the pre-trained knowledge to bridge the source and target domain gap. Catastrophic forgetting is the main challenge of CLIP in UDA where the traditional fine-tuning to adjust CLIP on a target domain can quickly override CLIP's pre-trained knowledge. To address the above issue, we propose to convert CLIP's features into high-dimensional vector (hypervector) space to utilize the robustness property of hypervector to mitigate catastrophic forgetting. We first study the feature dimension size in the hypervector space to empirically find the dimension threshold that allows enough feature patterns to be redundant to avoid excessive training (thus mitigating catastrophic forgetting). To further utilize the robustness of hypervector, we propose Discrepancy Reduction to reduce the domain shift between source and target domains, and Feature Augmentation to synthesize labeled target domain features from source domain features. We achieved the best results on four public UDA datasets, and we show the generalization of our method to other applications (few-shot learning, continual learning) and the model-agnostic property of our method across vision-language and vision backbones.

# 18

DiffClass: Diffusion-Based Class Incremental Learning

Zichong Meng · Jie Zhang · Changdi Yang · Zheng Zhan · Pu Zhao · Yanzhi Wang

Class Incremental Learning (CIL) is challenging due to catastrophic forgetting. On top of that, Exemplar-free Class Incremental Learning is even more challenging due to forbidden access to previous task data. Recent exemplar-free CIL methods attempt to mitigate catastrophic forgetting by synthesizing previous task data. However, they fail to overcome the catastrophic forgetting due to the inability to deal with the significant domain gap between real and synthetic data. To overcome these issues, we propose a novel exemplar-free CIL method. Our method adopts multi-distribution matching (MDM) diffusion models to unify quality and bridge domain gaps among all domains of training data. Moreover, our approach integrates selective synthetic image augmentation (SSIA) to expand the distribution of the training data, thereby improving the model's plasticity and reinforcing the performance of our method's ultimate component, multi-domain adaptation (MDA). With the proposed integrations, our method then reformulates exemplar-free CIL into a multi-domain adaptation problem to implicitly address the domain gap problem to enhance model stability during incremental training. Extensive experiments on benchmark class incremental datasets and settings demonstrate that our method excels previous exemplar-free CIL methods and achieves state-of-the-art performance.

# 19

Direct Distillation between Different Domains

Jialiang Tang · Shuo Chen · Gang Niu · Hongyuan Zhu · Joey Tianyi Zhou · Chen Gong · Masashi Sugiyama

Knowledge Distillation (KD) aims to learn a compact student network using knowledge from a large pre-trained teacher network, where both networks are trained on data from the same distribution. However, in practical applications, the student network may be required to perform in a new scenario (\emph{i.e.}, the target domain), which usually exhibits significant differences from the known scenario of the teacher network (\emph{i.e.}, the source domain). The traditional domain adaptation techniques can be integrated with KD in a two-stage process to bridge the domain gap, but the ultimate reliability of two-stage approaches tends to be limited due to the high computational consumption and the additional errors accumulated from both stages. To solve this problem, we propose a new one-stage method dubbed ``Direct Distillation between Different Domains" (4Ds). We first design a learnable adapter based on the Fourier transform to separate the domain-invariant knowledge from the domain-specific knowledge. Then, we build a fusion-activation mechanism to transfer the valuable domain-invariant knowledge to the student network, while simultaneously encouraging the adapter within the teacher network to learn the domain-specific knowledge of the target data. As a result, the teacher network can effectively transfer categorical knowledge that aligns with the target domain of the student network. Intensive experiments on various benchmark datasets demonstrate that our proposed 4Ds method successfully produces reliable student networks and outperforms state-of-the-art approaches.

# 23

Strong Double Blind

MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory

Juwon Kang · Nayeong Kim · Jungseul Ok · Suha Kwak

Test-time adaptation (TTA) has emerged as a promising approach to dealing with latent distribution shifts between training and testing data. However, most of existing TTA methods often struggle with small input batches, as they heavily rely on batch statistics that become less reliable as batch size decreases. In this paper, we introduce memory-based batch normalization (MemBN) to enhance the robustness of TTA across a wide range of batch sizes. MemBN leverages statistics memory queues within each batch normalization layer, accumulating the latest test batch statistics. Through dedicated memory management and aggregation algorithms, it enables to estimate reliable statistics that well represent the data distribution of the test domain in hand, leading to improved performance and robust test-time adaptation. Extensive experiments under a large variety of TTA scenarios demonstrate MemBN's superiority in terms of both accuracy and robustness.

# 15

PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning

Haiyang Guo · Fei Zhu · Wenzhuo Liu · Xu-Yao Zhang · Cheng-Lin Liu

Existing federated learning methods have effectively dealt with decentralized learning in scenarios involving data privacy and non-IID data. However, in real-world situations, each client dynamically learns new classes, requiring the global model to classify all seen classes. To effectively mitigate catastrophic forgetting and data heterogeneity under low communication costs, we propose a simple and effective method named PILoRA. On the one hand, we adopt prototype learning to learn better feature representations and leverage the heuristic information between prototypes and class features to design a prototype re-weight module to solve the classifier bias caused by data heterogeneity without retraining the classifier. On the other hand, we view incremental learning as the process of learning distinct task vectors and encoding them within different LoRA parameters. Accordingly, we propose Incremental LoRA to mitigate catastrophic forgetting. Experimental results on standard datasets indicate that our method outperforms the state-of-the-art approaches significantly. More importantly, our method exhibits strong robustness and superiority in different settings and degrees of data heterogeneity. Our code will be publicly available.

# 24

PromptFusion: Decoupling Stability and Plasticity for Continual Learning

Haoran Chen · Zuxuan Wu · Xintong Han · Menglin Jia · Yu-Gang Jiang

Current research on continual learning mainly focuses on relieving catastrophic forgetting, and most of their success is at the cost of limiting the performance of newly incoming tasks. Such a trade-off is referred to as the stability-plasticity dilemma and is a more general and challenging problem for continual learning. However, the inherent conflict between these two concepts makes it seemingly impossible to devise a satisfactory solution to both of them simultaneously. Therefore, we ask, ``is it possible to divide them into two separate problems to conquer them independently?''. To this end, we propose a prompt-tuning-based method termed PromptFusion to enable the decoupling of stability and plasticity. Specifically, PromptFusion consists of a carefully designed \stab module that deals with catastrophic forgetting and a \boo module to learn new knowledge concurrently. Furthermore, to address the computational overhead brought by the additional architecture, we propose PromptFusion-Lite which improves PromptFusion by dynamically determining whether to activate both modules for each input image. Extensive experiments show that both PromptFusion and PromptFusion-Lite achieve promising results on popular continual learning datasets for class-incremental and domain-incremental settings. Especially on Split-Imagenet-R, one of the most challenging datasets for class-incremental learning, our method can exceed state-of-the-art prompt-based method CODAPrompt by more than 5\% in accuracy, with PromptFusion-Lite using 14.8\% less computational resources than PromptFusion.

# 50

One-stage Prompt-based Continual Learning

Youngeun Kim · YUHANG LI · Priyadarshini Panda

Prompt-based Continual Learning (PCL) has gained considerable attention as a promising continual learning solution because it achieves state-of-the-art performance while preventing privacy violations and memory overhead problems. Nonetheless, existing PCL approaches face significant computational burdens because of two Vision Transformer (ViT) feed-forward stages; one is for the query ViT that generates a prompt query to select prompts inside a prompt pool; the other one is a backbone ViT that mixes information between selected prompts and image tokens. To address this, we introduce a one-stage PCL framework by directly using the intermediate layer's token embedding as a prompt query. This design removes the need for an additional feed-forward stage for query ViT, resulting in ~ 50% computational cost reduction for both training and inference with marginal accuracy drop (< 1%). We further introduce a Query-Pool Regularization (QR) loss that regulates the relationship between the prompt query and the prompt pool to improve representation power. The QR loss is only applied during training time, so there is no computational overhead at inference from the QR loss. With the QR loss, our approach maintains ~ 50% computational cost reduction during inference as well as outperforms the prior two-stage PCL methods by ~ 1.4% on public class-incremental continual learning benchmarks including CIFAR-100, ImageNet-R, and DomainNet.

# 21

Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution Images

Jacopo Bonato · Marco Cotogni · Luigi Sabetta

In this paper, we introduce Selective-distillation for Class and Architecture-agnostic unleaRning (SCAR), a novel approximate unlearning method. SCAR efficiently eliminates specific information while preserving the model's test accuracy without using a retain set, which is a key component in state-of-the-art approximate unlearning algorithms. Our approach utilizes a modified Mahalanobis distance to guide the unlearning of the feature vectors of the instances to be forgotten, aligning them to the nearest wrong class distribution. Moreover, we propose a distillation-trick mechanism that distills the knowledge of the original model into the unlearning model with out-of-distribution images for retaining the original model's test performance without using any retain set. Importantly, we propose a self-forget version of SCAR that unlearns without having access to the forget set. We experimentally verified the effectiveness of our method, on three public datasets, comparing it with state-of-the-art methods. Our method obtains performance higher than methods that operate without the retain set and comparable w.r.t the best methods that rely on the retain set.

# 8

Strong Double Blind

Idling Neurons, Appropriately Lenient Workload During Fine-tuning Leads to Better Generalization

Hongjing Niu · Hanting Li · Bin Li · Feng Zhao

Pre-training on large-scale datasets has become a fundamental method for training deep neural networks. Pre-training provides a better set of parameters than random initialization, which reduces the training cost of deep neural networks on the target task. In addition, pre-training also provides a large number of feature representations, which may help improve generalization capabilities. However, this potential advantage has not received enough attention and has been buried by rough fine-tuning. Based on some exploratory experiments, this paper rethinks the fine-tuning process and gives a new perspective on understanding fine-tuning. Moreover, this paper proposes some plug-and-play fine-tuning strategies as alternatives for simple fine-tuning. These fine-tuning strategies all preserve pre-trained features better by creating idling of some neurons, leading to better generalization.

# 20

Strong Double Blind

How to Train the Teacher Model for Effective Knowledge Distillation

Shayan Mohajer Hamidi · Xizhen Deng · Renhao Tan · Linfeng Ye · Ahmed Hussein Salamah

Recently, it was shown that the role of the teacher in knowledge distillation (KD) is to provide the student with an estimate of the true Bayes conditional probability density (BCPD). Notably, the new findings propose that the student's error rate can be upper-bounded by the mean squared error (MSE) between the teacher's output and BCPD. Consequently, to enhance KD efficacy, the teacher should be trained such that its output is close to BCPD in MSE sense. This paper elucidates that training the teacher model with MSE loss equates to minimizing the MSE between its output and BCPD, aligning with its core responsibility of providing the student with a BCPD estimate closely resembling it in MSE terms. In this respect, through a comprehensive set of experiments, we demonstrate that substituting the conventional teacher trained with cross-entropy loss with one trained using MSE loss in state-of-the-art KD methods consistently boosts the student's accuracy, resulting in improvements of up to 2.2\%.

# 13

Strong Double Blind

Local and Global Flatness for Federated Domain Generalization

Hao Yan · Yuhong Guo

Federated learning aims to train a unified model using isolated data distributed across multiple clients. However, traditional federated learning settings assume identical data distributions for both training and testing sets, neglecting the demand for model's cross-domain generalization ability when such assumption does not hold. Federated domain generalization seeks to develop a model that is capable of generalizing to unseen testing domains based on data isolated on multiple clients and sampled from multiple training domains. Challenges within this problem stem from both the lack of access to data from unseen testing domains and the absence of data from multiple training domains in each client. To address this, we propose a simple federated domain generalization method that seeks both local and global flatness. The proposed local flatness constraint prevents the model from converging to sharp minima, while the global flatness constraint encourages movement toward the global optimum. Both flatness constraints rely on adversarial parameter perturbation, with two perturbation methods proposed at the level of weights and singular values. Experimental results demonstrate that our proposed methods achieve state-of-the-art performance on standard federated domain generalization benchmarks.

# 11

Strong Double Blind

Dataset Quantization with Active Learning based Adaptive Sampling

Zhenghao Zhao · Yuzhang Shang · Junyi Wu · Yan Yan

Deep learning has made remarkable progress recently, largely due to the availability of large, well-labeled datasets. However, the training on such datasets elevates costs and computational demands. To address this, various techniques like coreset selection, dataset distillation, and dataset quantization have been explored in the literature. Unlike traditional techniques that depend on uniform sample distributions across different classes, our research demonstrates that maintaining performance is feasible even with uneven distributions. We find that for certain classes, the variation in sample quantity has a minimal impact on performance. Inspired by this observation, an intuitive idea is to reduce the number of samples for stable classes and increase the number of samples for sensitive classes to achieve a better performance with the same sampling ratio. Then the question arises: how can we adaptively select samples from a dataset to achieve optimal performance? In this paper, we propose a novel active learning based adaptive sampling strategy, Dataset Quantization with Active Learning based Adaptive Sampling (DQAS), to optimize the sample selection. In addition, we introduce a novel pipeline for dataset quantization, utilizing feature space from the final stage of dataset quantization to generate more precise dataset bins. Our comprehensive evaluations on the multiple datasets show that our approach outperforms the state-of-the-art dataset compression methods.

# 9

Strong Double Blind

DεpS: Delayed ε-Shrinking for Faster Once-For-All Training

Aditya Annavajjala · Alind Khare · Animesh Agrawal · Igor Fedorov · Hugo M Latapie · Myungjin Lee · Alexey Tumanov

CNNs are increasingly deployed across different hardware, dynamic environments, and low power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all training has emerged as a scalable approach that jointly co-trains many models (subnets) at once with a constant training cost and finds specialized CNNs later. The scalability is achieved by training the full model and simultaneously reducing it to smaller subnets that share model weights (weight-sharing). However, existing once-for-all training approaches incur huge training costs reaching 1200 GPU hours. We argue this is because they either start the process of shrinking the full model too early or too late. Hence, we propose DES that starts the process of shrinking the full model when it is partially trained (∼50%) which leads to training cost improvement. The proposed approach also consists of novel heuristics that dynamically adjust subnet learning rates incrementally (E), leading to better weight-shared knowledge distillation from larger to smaller subnets. As a result, DES outperforms state-of-the-art once-for-all training techniques across different datasets including CIFAR10/100, ImageNet-100, and ImageNet-1k on accuracy and cost. It achieves 1.83% higher ImageNet-1k top-1 accuracy or the same accuracy with 1.3x reduction in FLOPs and 2.5x drop in training cost (GPU*hrs)

# 12

Strong Double Blind

Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search

Haosen SUN · Lujun Li · Peijie Dong · Zimian Wei · Shitong Shao

Distillation-aware Architecture Search (DAS) seeks to discover the ideal student architecture that delivers superior performance by distilling knowledge from a given teacher model. Previous DAS methods involve time-consuming training-based search processes. Recently, the training-free DAS method (\ie, DisWOT) proposes KD-based proxies and achieves significant search acceleration. However, we observe that DisWOT suffers from limitations such as the need for manual design and poor generalization to diverse architectures, such as the Vision Transformer (ViT). To address these issues, we present Auto-DAS, an automatic proxy discovery framework using an Evolutionary Algorithm (EA) for training-free DAS. Specifically, we empirically find that proxies conditioned on student instinct statistics and teacher-student interaction statistics can effectively predict distillation accuracy. Then, we represent the proxy with computation graphs and construct the proxy search space using instinct and interaction statistics as inputs. To identify promising proxies, our search space incorporates various types of basic transformations and network distance operators inspired by previous proxy and KD-loss designs. Next, our EA initializes populations, evaluates, performs crossover and mutation operations, and selects the best correlation candidate with distillation accuracy. We introduce an adaptive-elite selection strategy to enhance search efficiency and strive for a balance between exploitation and exploration. Finally, we conduct training-free DAS with discovered proxy before the optimal student distillation phase. In this way, our auto-discovery framework eliminates the need for manual design and tuning, while also adapting to different search spaces through direct correlation optimization. Extensive experiments demonstrate that Auto-DAS generalizes well to various architectures and search spaces (\eg, ResNet, ViT, NAS-Bench-101, and NAS-Bench-201), achieving state-of-the-art results in both ranking correlation and final searched accuracy. Code is included in the supplementary materials.

# 63

Strong Double Blind

On Spectral Properties of Gradient-based Explanation Methods

Amir Mehrpanah · Erik Englesson · Hossein Azizpour

Understanding the behavior of deep networks is crucial to increase our confidence in their results. Despite an extensive body of work for explaining their predictions, researchers have faced with reliability issues, which can be attributed to insufficient formalism. In our research, we adopt novel probabilistic and spectral perspectives to formally analyze explanation methods. Our study reveals a pervasive spectral bias stemming from the use of gradient, and sheds light on some common design choices that have been discovered experimentally, in particular, the use of squared gradient and input perturbation. We further characterize how the choice of perturbation hyperparameters in explanation methods, such as SmoothGrad, can lead to inconsistent explanations and introduce two remedies based on our proposed formalism: (i) a mechanism to determine a standard perturbation scale, and (ii) an aggregation method which we call SpectralLens. Finally, we substantiate our theoretical results through quantitative evaluations.

# 6

Cross-Input Certified Training for Universal Perturbations

Changming Xu · Gagandeep Singh

Existing work in trustworthy machine learning primarily focuses on single-input adversarial perturbations. In many real-world attack scenarios, input-agnostic adversarial attacks, e.g. universal adversarial perturbations (UAPs), are much more feasible. Current certified training methods train models robust to single-input perturbations but achieve suboptimal clean and UAP accuracy, thereby limiting their applicability in practical applications. We propose a novel method, CITRUS, for certified training of networks robust against UAP attackers. We show in an extensive evaluation across different datasets, architectures, and perturbation magnitudes that our method outperforms traditional certified training methods on standard accuracy (up to 10.3%) and achieves SOTA performance on the more practical certified UAP accuracy metric.

# 7

Strong Double Blind

Interpretability-Guided Test-Time Adversarial Defense

Akshay Ravindra Kulkarni · Tsui-Wei Weng

In this work, we propose a novel and low-cost test-time defense by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is a training-free approach that can significantly improve the robustness-accuracy tradeoff while incurring minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.

# 16

Strong Double Blind

Exploring Guided Sampling of Conditional GANs

Yifei Zhang · Mengfei Xia · Yujun Shen · Jiapeng Zhu · Ceyuan Yang · Kecheng Zheng · Lianghua Huang · Yu Liu · Fan Cheng

Guided sampling serves as a widely used inference technique in diffusion models to trade off the sample fidelity and diversity.In this work, we confirm that generative adversarial networks (GANs) can also benefit from guided sampling, not even requiring to pre-prepare a classifier (\textit{i.e.}, classifier guidance) or learn an unconditional counterpart (\textit{i.e.}, classifier-free guidance) as in diffusion models. Inspired by the organized latent space in GANs, we manage to estimate the data-condition joint distribution from a well-learned conditional generator simply through vector arithmetic. With such an \textit{easy implementation}, our approach, termed \method, improves the FID score of a state-of-the-art GAN model pre-trained on ImageNet $64\times64$ from 8.87 to 6.06, barely increasing the inference time. We then propose a learning-based variant of our framework to better approximate the distribution of the entire dataset, further improving the FID score to 4.37. It is noteworthy that our sampling strategy sufficiently closes the gap between GANs and one-step diffusion models (\textit{i.e.}, with FID 4.02) under comparable model size. We will release the code to facilitate future studies.

# 53

Strong Double Blind

Self-Supervised Representation Learning for Adversarial Attack Detection

Yi Li · Plamen Angelov · Neeraj Suri

Supervised learning-based adversarial attack detection methods rely on a large number of labeled data and suffer significant performance degradation when applying the trained model to new domains. In this paper, we propose a self-supervised representation learning framework for the adversarial attack detection task to address this drawback. Firstly, we map the pixels of augmented input images into an embedding space. Then, we employ the prototype-wise contrastive estimation loss to cluster prototypes as latent variables. Additionally, drawing inspiration from the concept of memory banks, we introduce a discrimination bank to distinguish and learn representations for each individual instance that shares the same or a similar prototype, establishing a connection between instances and their associated prototypes. Experimental results show that, compared to various benchmark self-supervised vision learning models and supervised adversarial attack detection methods, the proposed model achieves state-of-the-art performance on the adversarial attack detection task across a wide range of images.

# 5

Strong Double Blind

Non-transferable Pruning

Ruyi Ding · Lili Su · A. Adam Ding · Yunsi Fei

Pretrained Deep Neural Networks (DNNs), developed from extensive datasets to integrate multifaceted knowledge, are increasingly recognized as valuable intellectual property (IP). To safeguard these models against IP infringement, strategies for ownership verification and usage authorization have emerged. Unlike most existing IP protection strategies that concentrate on restricting direct access to the model, our study addresses an extended DNN IP issue: applicability authorization, aiming to prevent the misuse of learned knowledge, particularly in unauthorized transfer learning scenarios. We propose Non-Transferable Pruning (NTP), a novel IP protection method that leverages model pruning to control a pretrained DNN's transferability to unauthorized data domains. Selective pruning can deliberately diminish a model's suitability on unauthorized domains, even with full fine-tuning. Specifically, our framework employs the alternating direction method of multipliers (ADMM) method for optimizing both the model sparsity and an innovative non-transferable learning loss, augmented with fisher space discriminative regularization, to constrain the model’s generalizability to the target dataset. We also propose a novel effective metric to measure the model non-transferability: Area Under the Sample-wise Learning Curve (SLC-AUC). This metric facilitates consideration of full fine-tuning across various sample sizes. Experimental results demonstrate that NTP significantly surpasses the state-of-the-art non-transferable learning methods, with an average SLC-AUC at -0.54 across diverse pairs of source and target domains, indicating that models trained with NTP do not suit for transfer learning to unauthorized target domains. The efficacy of NTP is validated in both supervised and self-supervised learning contexts, confirming its applicability in real-world scenarios.

# 3

Strong Double Blind

On the Vulnerability of Skip Connections to Model Inversion Attacks

Jun Hao Koh · Sy-Tuyen Ho · Ngoc-Bao Nguyen · Ngai-Man Cheung

Skip connections are fundamental architecture designs for modern deep neural networks (DNNs) such as CNNs and ViTs. While they help improve model performance significantly, we identify a vulnerability associated with skip connections to Model Inversion (MI) attacks, a type of privacy attack that aims to reconstruct private training data through abusive exploitation of a model. In this paper, as a pioneer work to understand how DNN architectures affect MI, we study the impact of skip connections on MI. We make the following discoveries: 1) Skip connections reinforce MI attacks and compromise data privacy. 2) Skip connections in the last stage are the most critical to attack. 3) RepVGG, an approach to remove skip connections in the inference-time architectures, could not mitigate the vulnerability to MI attacks. Based on our findings, we propose MI-resilient architecture designs for the first time. Without bells and whistles we show in extensive experiments that our MI-resilient architectures can outperform state-of-the-art (SOTA) defense methods in MI robustness. Furthermore, our MI-resilient architectures are complementary to existing MI defense methods. Our code, pre-trained models and additional results are included in Supp.

# 4

Strong Double Blind

Clean & Compact: Efficient Data-Free Backdoor Defense with Model Compactness

Huy Phan · Jinqi Xiao · Yang Sui · Tianfang Zhang · Zijie Tang · Cong Shi · Yan Wang · Yingying Chen · BO YUAN

Deep neural networks (DNNs) have been widely deployed in real-world, mission-critical applications, necessitating effective approaches to protect deep learning models against malicious attacks. Motivated by the high stealthiness and potential harm of backdoor attacks, a series of backdoor defense methods for DNNs have been proposed. However, most existing approaches require access to clean training data, hindering their practical use. Additionally, state-of-the-art (SOTA) solutions cannot simultaneously enhance model robustness and compactness in a data-free manner, which is crucial in resource-constrained applications. To address these challenges, in this paper, we propose Clean \& Compact (C\&C), an efficient data-free backdoor defense mechanism that can bring both purification and compactness to the original infected DNNs. Built upon the intriguing rank-level sensitivity to trigger patterns, C\&C co-explores and achieves high model cleanliness and efficiency without the need for training data, making this solution very attractive in many real-world, resource-limited scenarios. Extensive evaluations across different settings consistently demonstrate that our proposed approach outperforms SOTA backdoor defense methods.

# 197

Spiking Wavelet Transformer

Yuetong Fang · Ziqing Wang · Lingfeng Zhang · Jiahang Cao · Honglei Chen · Renjing Xu

Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by mimicking the event-driven processing of the brain. Incorporating the Transformers with SNNs has shown promise for accuracy, yet it is incompetent to capture high-frequency patterns like moving edge and pixel-level brightness changes due to their reliance on global self-attention operations. Porting frequency representations in SNN is challenging yet crucial for event-driven vision. To address this issue, we propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner by leveraging the sparse wavelet transform. The critical component is a Frequency-aware Spiking Token Mixer (FSTM) with three branches: 1) spiking wavelet learner for spatial-frequency domain learning, 2) convolution-based learner for spatial feature extraction, and 3) spiking pointwise convolution for cross-channel information aggregation. We also adopt negative spike dynamics to strengthen the frequency representation further. This enables the SWformer to outperform vanilla spiking Transformers in capturing high-frequency visual components, as evidenced by our empirical results. Experiments on both static and neuromorphic datasets demonstrate SWformer's effectiveness in capturing spatial-frequency patterns in a multiplication-free, event-driven fashion, outperforming state-of-the-art SNNs. SWformer achieves an over 50 reduction in energy consumption, a 21.1 reduction in parameter count, and a 2.40 performance improvement on the ImageNet dataset compared to vanilla Spiking Transformers.

# 126

Strong Double Blind

PFGS: High Fidelity Point Cloud Rendering via Feature Splatting

Jiaxu Wang · Zhang Ziyi · Junhao He · Renjing Xu

Rendering high-fidelity images from sparse point clouds is still challenging. Existing learning-based approaches suffer from either hole artifacts, missing details, or expensive computations. In this paper, we propose a novel framework to render high-quality images from sparse points. This method first attempts to bridge the 3D Gaussian Splatting and point cloud rendering, which includes several cascaded modules. We first use a regressor to estimate Gaussian properties in a point-wise manner, the estimated properties are used to rasterize neural feature descriptors into 2D planes which are extracted from a multiscale extractor. The projected feature volume is gradually decoded toward the final prediction via a multiscale and progressive decoder. The whole pipeline experiences a two-stage training and is driven by our well-designed progressive and multiscale reconstruction loss. Experiments on different benchmarks show the superiority of our method in terms of rendering qualities and the necessities of our main components.