ECCV 2024 Schedule

Filter Events

SUN 29 SEP

8 a.m.

Registration

(ends 6:00 PM)

9 a.m.

Workshop:

Recovering 6D Object Pose

(ends 1:00 PM)

Workshop:

Workshop on Spatial AI

(ends 1:00 PM)

Workshop:

3D Vision and Modeling Challenges in eCommerce

(ends 1:00 PM)

Workshop:

9th Workshop on Computer Vision in Plant Phenotyping and Agriculture (CVPPA)

(ends 1:00 PM)

Workshop:

3rd edition of Computer Vision for Metaverse (CV4Metaverse)

(ends 1:00 PM)

Workshop:

AI for Visual Arts Workshop and Challenges (AI4VA)

(ends 1:00 PM)

Workshop:

Visual object tracking and segmentation challenge VOTS2024 workshop

(ends 1:00 PM)

Workshop:

ACVR2024 - 12th International Workshop on Assistive Computer Vision and Robotics

(ends 1:00 PM)

Workshop:

Workshop on Artificial Social Intelligence

(ends 1:00 PM)

Workshop:

The First Workshop on Expressive Encounters: Co-speech gestures across cultures in the wild

(ends 1:00 PM)

Workshop:

BioImage Computing (BIC)

(ends 1:00 PM)

Workshop:

Self-Supervised Learning - What is next?

(ends 1:00 PM)

Workshop:

Beyond Euclidean: Hyperbolic and Hyperspherical Learning for Computer Vision

(ends 1:00 PM)

Workshop:

The Second Perception Test Challenge

(ends 1:00 PM)

Workshop:

Fairness and ethics towards transparent AI: facing the chalLEnge through model Debiasing (FAILED)

(ends 1:00 PM)

Workshop:

2nd International Workshop on Privacy-Preserving Computer Vision

(ends 1:00 PM)

Workshop:

Critical Evaluation of Generative Models and their Impact on Society

(ends 1:00 PM)

Workshop:

Scalable 3D Scene Generation and 3D Geometric Scene Understanding

(ends 1:00 PM)

Workshop:

Eyes of the Future: Integrating Computer Vision in Smart Eyewear

(ends 1:00 PM)

Tutorial:

Large Multimodal Foundation Models

(ends 1:00 PM)

Tutorial:

Efficient Text-to-Image and Text-to-3D modeling

(ends 1:00 PM)

10:30 a.m.

Break:

Coffee Break

(ends 11:00 AM)

1 p.m.

Break:

Lunch

(ends 2:00 PM)

2 p.m.

Workshop:

Half-century of Structure-from-Motion (50SfM)

(ends 6:00 PM)

Workshop:

Transparent & Reflective objects In the wild Challenges (TRICKY)

(ends 6:00 PM)

Workshop:

The First Workshop on: Computer Vision for Videogames (CV2)

(ends 6:00 PM)

Workshop:

AI4DH: Artificial Intelligence for Digital Humanities

(ends 6:00 PM)

Workshop:

The Third ROAD Workshop & Challenge: Event Detection for Situation Awareness in Autonomous Driving

(ends 6:00 PM)

Workshop:

Autonomous Vehicles meet Multimodal Foundation Models

(ends 6:00 PM)

Workshop:

5th Advances in Image Manipulation (AIM) Workshop and Challenges

(ends 6:00 PM)

Workshop:

Efficient Deep Learning for Foundation Models

(ends 6:00 PM)

Workshop:

T-CAP - Towards a Complete Analysis of People: Fine-grained Understanding for Real-World Applications

(ends 6:00 PM)

Workshop:

Human-inspired Computer Vision

(ends 6:00 PM)

Workshop:

Traditional Computer Vision in the Age of Deep Learning (TradiCV)

(ends 6:00 PM)

Workshop:

Workshop on Unlearning and Model Editing (U&ME'24)

(ends 6:00 PM)

Workshop:

2nd Workshop on Quantum Computer Vision and Machine Learning (QCVML)

(ends 6:00 PM)

Workshop:

AVGenL: Audio-Visual Generation and Learning

(ends 6:00 PM)

Workshop:

2nd OmniLabel Workshop: Enabling Complex Perception Through Vision and Language Foundational Models

(ends 6:00 PM)

Workshop:

The Dark Side of Generative AIs and Beyond

(ends 6:00 PM)

Workshop:

Explainable AI for Computer Vision: Where Are We and Where Are We Going?

(ends 6:00 PM)

Workshop:

OpenSUN3D: 3rd Workshop on Open-Vocabulary 3D Scene Understanding

(ends 6:00 PM)

Workshop:

Workshop on Neuromorphic Vision (NeVi): Advantages and Applications of Event Cameras

(ends 6:00 PM)

Tutorial:

Third Hands-on Egocentric Research Tutorial with Project Aria, from Meta

(ends 6:00 PM)

Tutorial:

Responsibly Building Generative Models

(ends 6:00 PM)

3:30 p.m.

Break:

Coffee Break

(ends 4:00 PM)

MON 30 SEP

8 a.m.

Registration

(ends 6:00 PM)

9 a.m.

Workshop:

Dense Neural SLAM Workshop (NeuSLAM)

(ends 1:00 PM)

Workshop:

Wild3D: 3D Modeling, Reconstruction, and Generation in the Wild

(ends 1:00 PM)

Workshop:

CV For Ecology Workshop (CV4E)

(ends 1:00 PM)

Workshop:

2nd Workshop on Vision-based Industrial Inspection (VISION)

(ends 1:00 PM)

Workshop:

Vision for Art (VISART) VII Workshop

(ends 1:00 PM)

Workshop:

Vision-Centric Autonomous Driving (VCAD) Workshop

(ends 1:00 PM)

Workshop:

Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving: Towards Next-Generation Solutions

(ends 1:00 PM)

Workshop:

Instance-Level Recognition

(ends 1:00 PM)

Workshop:

Computational Aspects of Deep Learning

(ends 1:00 PM)

Workshop:

Uncertainty Quantification for Computer Vision

(ends 1:00 PM)

Workshop:

The 3rd Workshop for Out-of-Distribution Generalization in Computer Vision Foundation Models

(ends 1:00 PM)

Workshop:

2nd Workshop on More Exploration, Less Exploitation (MELEX)

(ends 1:00 PM)

Workshop:

TWYN: Trust What You learN. 1st Workshop on Trustworthiness in Computer Vision

(ends 1:00 PM)

Workshop:

xAI4Biometrics at ECCV 2024 - 4th Workshop on Explainable & Interpretable Artificial Intelligence for Biometrics

(ends 1:00 PM)

Workshop:

Map-free Visual Relocalization

(ends 1:00 PM)

Workshop:

1st Workshop on Neural Fields Beyond Conventional Cameras

(ends 1:00 PM)

Tutorial:

A Bayesian Odyssey in Uncertainty: from Theoretical Foundations to Real-World Applications

(ends 1:00 PM)

Tutorial:

Emerging Trends in Disentanglement and Compositionality

(ends 1:00 PM)

Tutorial:

Recent Advances in Video Content Understanding and Generation

(ends 6:00 PM)

Tutorial:

Time is precious: Self-Supervised Learning Beyond Images

(ends 1:00 PM)

10:30 a.m.

Break:

Coffee Break

(ends 11:00 AM)

1 p.m.

Break:

Lunch

(ends 2:00 PM)

2 p.m.

Workshop:

Geometry in the Large Model Era

(ends 6:00 PM)

Workshop:

AI3DCC: The Second Workshop of AI for 3D Content Creation

(ends 6:00 PM)

Workshop:

FashionAI: Exploring the intersection of Fashion and Artificial Intelligence for reshaping the Industry

(ends 6:00 PM)

Workshop:

ROAM: Robust, Out-of-Distribution And Multi-Modal models for Autonomous Driving

(ends 6:00 PM)

Workshop:

Multi-Agent Autonomous Systems Meet Foundation Models: Challenges and Futures

(ends 6:00 PM)

Workshop:

Large-scale Video Object Segmentation

(ends 6:00 PM)

Workshop:

Foundation Models for 3D Humans

(ends 6:00 PM)

Workshop:

Observing and Understanding Hands in Action

(ends 6:00 PM)

Workshop:

7th Workshop and Competition on Affective Behavior Analysis in-the-wild

(ends 6:00 PM)

Workshop:

Knowledge in Generative Models

(ends 6:00 PM)

Workshop:

Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo)

(ends 6:00 PM)

Workshop:

Workshop on Visual Concepts

(ends 6:00 PM)

Workshop:

Sometimes Less is More: The First Dataset Distillation Challenge

(ends 6:00 PM)

Workshop:

Synthetic Data for Computer Vision

(ends 6:00 PM)

Workshop:

Multimodal Agents Workshop

(ends 6:00 PM)

Workshop:

FOundation models Creators meet USers (FOCUS)

(ends 6:00 PM)

Workshop:

Women in Computer Vision

(ends 6:00 PM)

Workshop:

Workshop on Green Foundation Models

(ends 6:00 PM)

Workshop:

GigaVision: When Gigapixel Videography Meets Computer Vision

(ends 6:00 PM)

Tutorial:

Inside Plato's door: a tour in Multi-view Geometry

(ends 6:00 PM)

3:30 p.m.

Break:

Coffee Break

(ends 4:00 PM)

TUE 1 OCT

7 a.m.

Registration

(ends 6:30 PM)

8 a.m.

Opening Ceremony [8:00-9:00]

(ends 9:00 AM)

9 a.m.

Oral 1A: Scene Analysis And Understanding [9:00-10:30]

Orals 9:00-10:20

[9:00] Towards Scene Graph Anticipation

[9:10] OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

[9:20] PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

[9:30] Bi-directional Contextual Attention for 3D Dense Captioning

[9:40] OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

[9:50] ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

[10:00] A Fair Ranking and New Model for Panoptic Scene Graph Generation

[10:10] Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

(ends 10:30 AM)

Oral 1B: Autonomous Driving [9:00-10:30]

Orals 9:00-10:20

[9:00] Making Large Language Models Better Planners with Reasoning-Decision Alignment

[9:10] MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping

[9:20] M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

[9:30] H-V2X: A Large Scale Highway Dataset for BEV Perception

[9:40] Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction

[9:50] DriveLM: Driving with Graph Visual Question Answering

[10:00] RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios

[10:10] Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

(ends 10:30 AM)

Oral 1C: Low-Level Vision And Imaging [9:00-10:30]

Orals 9:00-10:20

[9:00] Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection

[9:10] Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging

[9:20] SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

[9:30] Photon Inhibition for Energy-Efficient Single-Photon Imaging

[9:40] Minimalist Vision with Freeform Pixels

[9:50] Flying with Photons: Rendering Novel Views of Propagating Light

[10:00] A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

[10:10] GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

(ends 10:30 AM)

Demo Session 1A [9:00-12:30]

Demonstrations 9:00-12:30

Controllable Face Synthesis with Semantic Latent Diffusion Models

Dynaphos: A VR demo of biologically-plausible simulated phosphene vision for visual cortical prostheses

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

OPT-IQA: Automated Camera Parameters Tuning Framework with IQA-guided Optimization

Transforming Retail with Shopic's Vision & AI-Powered Smart Cart

(ends 12:30 PM)

10:30 a.m.

Poster Session 1 [10:30-12:30]

Posters 10:30-12:30

Bi-directional Contextual Attention for 3D Dense Captioning

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

Towards Scene Graph Anticipation

OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

H-V2X: A Large Scale Highway Dataset for BEV Perception

RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios

DriveLM: Driving with Graph Visual Question Answering

Making Large Language Models Better Planners with Reasoning-Decision Alignment

M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping

Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

Photon Inhibition for Energy-Efficient Single-Photon Imaging

Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging

Minimalist Vision with Freeform Pixels

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

MagicMirror: Fast and High-Quality Avatar Generation with Constrained Search Space

Personalized Video Relighting With an At-Home Light Stage

Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations

Panel-Specific Degradation Representation for Raw Under-Display Camera Image Restoration

HoloADMM: High-Quality Holographic Complex Field Recovery

Flying with Photons: Rendering Novel Views of Propagating Light

Efficient Depth-Guided Urban View Synthesis

Ray-Distance Volume Rendering for Neural Scene Reconstruction

Taming Latent Diffusion Model for Neural Radiance Field Inpainting

Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors

GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer

MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References

UNIKD: UNcertainty-Filtered Incremental Knowledge Distillation for Neural Implicit Representation

Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction

Sur^2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Differentiable Convex Polyhedra Optimization from Multi-view Images

Combining Generative and Geometry Priors for Wide-Angle Portrait Correction

I2-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians

GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting

FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally

PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis

MegaScenes: Scene-Level View Synthesis at Scale

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

View-Consistent 3D Editing with Gaussian Splatting

Compress3D: a Compressed Latent Space for 3D Generation from a Single Image

Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views

Nuvo: Neural UV Mapping for Unruly 3D Representations

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control

DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

TPA3D: Triplane Attention for Fast Text-to-3D Generation

DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement

WordRobe: Text-Guided Generation of Textured 3D Garments

AnyHome: Open-Vocabulary Large-Scale Indoor Scene Generation with First-Person View Exploration

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

SENC: Handling Self-collision in Neural Cloth Simulation

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Diffusion Models as Data Mining Tools

ReMatching: Low-Resolution Representations for Scalable Shape Correspondence

PolyRoom: Room-aware Transformer for Floorplan Reconstruction

WindPoly: Polygonal Mesh Reconstruction via Winding Numbers

Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack

Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

Diffusion Bridges for 3D Point Cloud Denoising

Towards a Density Preserving Objective Function for Learning on Point Sets

Syn-to-Real Domain Adaptation for Point Cloud Completion via Part-based Approach

T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning

Text2LiDAR: Text-guided LiDAR Point Clouds Generation via Equirectangular Transformer

DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields

Computing the Lipschitz constant needed for fast scene recovery from CASSI measurements

Regularizing Dynamic Radiance Fields with Kinematic Fields

GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation

iMatching: Imperative Correspondence Learning

Fundamental Matrix Estimation Using Relative Depths

Track Everything Everywhere Fast and Robustly

Learning to Make Keypoints Sub-Pixel Accurate

Shape-guided Configuration-aware Learning for Endoscopic-image-based Pose Estimation of Flexible Robotic Instruments

FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation

Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction

Event-based Head Pose Estimation: Benchmark and Method

Parameterized Quasi-Physical Simulators for Dexterous Manipulations Transfer

RAW-Adapter: Adapting Pretrained Visual Model to Camera RAW Images

Easing 3D Pattern Reasoning with Side-view Features for Semantic Scene Completion

Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions

GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

CountFormer: Multi-View Crowd Counting Transformer

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation

4D Contrastive Superflows are Dense 3D Representation Learners

TCC-Det: Temporarily consistent cues for weakly-supervised 3D detection

CARB-Net: Camera-Assisted Radar-Based Network for Vulnerable Road User Detection

SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving

RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception

TrafficNight : An Aerial Multimodal Benchmark For Nighttime Vehicle Surveillance

RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes

Monocular Occupancy Prediction for Scalable Indoor Scenes

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

Neural Volumetric World Models for Autonomous Driving

Progressive Pretext Task Learning for Human Trajectory Prediction

Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving

Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries

Towards Dual Transparent Liquid Level Estimation in Biomedical Lab: Dataset, Methods and Practice

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Temporally Consistent Stereo Matching

Retrieval Robust to Object Motion Blur

Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework

Diffusion Reward: Learning Rewards via Conditional Video Diffusion

HUMOS: Human Motion Model Conditioned on Body Shape

PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture

Large Motion Model for Unified Multi-Modal Motion Generation

Realistic Human Motion Generation with Cross-Diffusion Models

Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions

Generating Human Interaction Motions in Scenes with Text Control

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

MoVideo: Motion-Aware Video Generation with Diffusion Models

FreeInit: Bridging Initialization Gap in Video Diffusion Models

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

ReNoise: Real Image Inversion Through Iterative Noising

Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting

One-Shot Diffusion Mimicker for Handwritten Text Generation

Investigating Style Similarity in Diffusion Models

DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation

PartCraft: Crafting Creative Objects by Parts

DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators

WAS: Dataset and Methods for Artistic Text Segmentation

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

Do text-free diffusion models learn discriminative visual representations?

DataDream: Few-shot Guided Dataset Generation

DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation

ZeST: Zero-Shot Material Transfer from a Single Image

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Learning Equilibrium Transformation for Gamut Expansion and Color Restoration

Timestep-Aware Correction for Quantized Diffusion Models

Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer.

Energy-Clibrated VAE with Test Time Free Lunch

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Prompt-Based Test-Time Real Image Dehazing: A Novel Pipeline

Asymmetric Mask Scheme for Self-Supervised Real Image Denoising

GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity

Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

A New Dataset and Framework for Real-World Blurred Images Super-Resolution

Blind image deblurring with noise-robust kernel estimation

SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering

Towards Robust Full Low-bit Quantization of Super Resolution Networks

Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network

SAH-SCI: Self-Supervised Adapter for Efficient Hyperspectral Snapshot Compressive Imaging

Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling

DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays

BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression

SNeRV: Spectra-preserving Neural Representation for Video

Multiscale Graph Texture Network

DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors

Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection

AdversariaLeak: External Information Leakage Attack Using Adversarial Samples on Face Recognition Systems

Continual Learning for Remote Physiological Measurement: Minimize Forgetting and Simplify Inference

NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation

Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning

Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours Maps

SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Towards More Practical Group Activity Detection: A New Benchmark and Model

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

Online Temporal Action Localization with Memory-Augmented Transformer

EgoLifter: Open-world 3D Segmentation for Egocentric Perception

MeshSegmenter: Zero-Shot Mesh Segmentation via Texture Synthesis

Spatial-Temporal Multi-level Association for Video Object Segmentation

Gated Temporal Diffusion for Stochastic Long-term Dense Anticipation

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment

VideoMamba: State Space Model for Efficient Video Understanding

Text-Conditioned Resampler For Long Form Video Understanding

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Vamos: Versatile Action Models for Video Understanding

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

Multi-Sentence Grounding for Long-term Instructional Video

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

CityGuessr: City-Level Video Geo-Localization on a Global Scale

WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

LingoQA: Video Question Answering for Autonomous Driving

Dolphins: Multimodal Language Model for Driving

PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation

LLM as Copilot for Coarse-grained Vision-and-Language Navigation

Visual Grounding for Object-Level Generalization in Reinforcement Learning

m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Recursive Visual Programming

Any2Point: Empowering Any-modality Transformers for Efficient 3D Understanding

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

ViG-Bias: Visually Grounded Bias Discovery and Mitigation

GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator

Adversarial Prompt Tuning for Vision-Language Models

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Synergy of Sight and Semantics: Visual Intention Understanding with CLIP

FlexAttention for Efficient High-Resolution Vision-Language Models

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Trackastra: Transformer-based cell tracking for live-cell microscopy

Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking

Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs

E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion

Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective

Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization

Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras

Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation

X-Pose: Detecting Any Keypoints

Open-Set Recognition in the Age of Vision-Language Models

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

A Fair Ranking and New Model for Panoptic Scene Graph Generation

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

A Simple Background Augmentation Method for Object Detection with Diffusion Model

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Agent Attention: On the Integration of Softmax and Linear Attention

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

Agglomerative Token Clustering

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images

Open-Vocabulary RGB-Thermal Semantic Segmentation

PartSTAD: 2D-to-3D Part Segmentation Task Adaptation

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions

Progressive Proxy Anchor Propagation for Unsupervised Semantic Segmentation

Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation

Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off

Pseudo-Embedding for Generalized Few-Shot Point Cloud Segmentation

Self-supervised co-salient object detection via feature correspondences at multiple scales

Unsupervised Dense Prediction using Differentiable Normalized Cuts

Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM

Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection

Bucketed Ranking-based Losses for Efficient Training of Object Detectors

Better Regression Makes Better Test-time Adaptive 3D Object Detection

MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection

IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection

Semi-supervised Segmentation of Histopathology Images with Noise-Aware Topological Consistency

The Devil is in the Statistics: Mitigating and Exploiting Statistics Difference for Generalizable Semi-supervised Medical Image Segmentation

A Rotation-invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images

Multistain Pretraining for Slide Representation Learning in Pathology

Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data

HERGen: Elevating Radiology Report Generation with Longitudinal Data

Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics

Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection

A Unified Image Compression Method for Human Perception and Multiple Vision Tasks

FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion

Quantization-Friendly Winograd Transformations for Convolutional Neural Networks

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Stripe Observation Guided Inference Cost-free Attention Mechanism

NOVUM: Neural Object Volumes for Robust Object Classification

POA: Pre-training Once for Models of All Sizes

Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks

Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization

Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning

MultiDelete for Multimodal Machine Unlearning

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

Multi-Label Cluster Discrimination for Visual Representation Learning

Robustness Preserving Fine-tuning using Neuron Importance

Online Zero-Shot Classification with CLIP

Understanding Multi-compositional learning in Vision and Language models via Category Theory

This Probably Looks Exactly Like That: An Invertible Prototypical Network

Rethinking Unsupervised Outlier Detection via Multiple Thresholding

Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection

Multimodal Label Relevance Ranking via Reinforcement Learning

Confidence Self-Calibration for Multi-Label Class-Incremental Learning

MTaDCS: Moving Trace and Feature Density-based Confidence Sample Selection under Label Noise

Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation

Online Continuous Generalized Category Discovery

Open-set Domain Adaptation via Joint Error based Multi-class Positive and Unlabeled Learning

UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework

Rethinking Few-shot Class-incremental Learning: Learning from Yourself

Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning

Semantic Residual Prompts for Continual Learning

Encapsulating Knowledge in One Prompt

Representation Enhancement-Stabilization: Reducing Bias-Variance of Domain Generalization

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation

Dataset Distillation by Automatic Training Trajectories

Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction

Graph Neural Network Causal Explanation via Neural Causal Models

Optimization-based Uncertainty Attribution Via Learning Informative Perturbations

Generalizable Symbolic Optimizer Learning

CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction

Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation

Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense

SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning

How Video Meetings Change Your Expression

MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks

(ends 12:30 PM)

Break:

Coffee Break

(ends 11:00 AM)

noon

Mentorship:

Doctoral Consortium

(ends 2:00 PM)

12:30 p.m.

Lunch:

Lunch

(ends 1:30 PM)

1:30 p.m.

Oral 2A: Generative Models I [1:30-3:30]

Orals 1:30-3:20

[1:30] EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

[1:40] TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

[1:50] LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

[2:00] FlashTex: Fast Relightable Mesh Texturing with LightControlNet

[2:10] TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

[2:20] LLMGA: Multimodal Large Language Model based Generation Assistant

[2:30] Accelerating Image Generation with Sub-path Linear Approximation Model

[2:40] SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation

[2:50] Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture

[3:00] Zero-Shot Detection of AI-Generated Images

[3:10] Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

(ends 3:30 PM)

Oral 2B: Recognition [1:30-3:30]

Orals 1:30-3:20

[1:30] Efficient Bias Mitigation Without Privileged Information

[1:40] Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

[1:50] MobileNetV4: Universal Models for the Mobile Ecosystem

[2:00] Momentum Auxiliary Network for Supervised Local Learning

[2:10] From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

[2:20] Dataset Enhancement with Instance-Level Augmentations

[2:30] Adaptive Parametric Activation

[2:40] Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

[2:50] Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

[3:00] CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

[3:10] On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

(ends 3:30 PM)

Oral 2C: Multi-View And Visual Odometry [1:30-3:30]

Orals 1:30-3:20

[1:30] Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition

[1:40] COMO: Compact Mapping and Odometry

[1:50] Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss

[2:00] ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

[2:10] SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

[2:20] Six-Point Method for Multi-Camera Systems with Reduced Solution Space

[2:30] Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

[2:40] Grounding Image Matching in 3D with MASt3R

[2:50] ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images

[3:00] Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

[3:10] Camera Calibration using a Collimator System

(ends 3:30 PM)

2:30 p.m.

Demo Session 1B [2:30-6:00]

Demonstrations 2:30-6:00

AR Deployment and Scene Modelling on Your Phone

A Tool for Collecting Spatio-temporally Sparse Point Annotations for Video Object Segmentation

Leveraging Computer Vision on the Ski Slopes

OpenCity: Open-Vocabulary Attribution of 3D Buildings in City-Scale Photogrammetric Meshes

Visual Place Recognition using 3D City Models

(ends 6:00 PM)

3:30 p.m.

Keynote:

Synthesia: From computer vision research to real-world AI avatars

Lourdes Agapito · Vittorio Ferrari

(ends 4:30 PM)

4:30 p.m.

Break:

Coffee Break

(ends 5:00 PM)

Poster Session 2 [4:30-6:30]

Posters 4:30-6:30

Zero-Shot Detection of AI-Generated Images

MobileNetV4: Universal Models for the Mobile Ecosystem

Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

Adaptive Parametric Activation

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

Dataset Enhancement with Instance-Level Augmentations

Efficient Bias Mitigation Without Privileged Information

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

Momentum Auxiliary Network for Supervised Local Learning

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

COMO: Compact Mapping and Odometry

Camera Calibration using a Collimator System

Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition

SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss

Six-Point Method for Multi-Camera Systems with Reduced Solution Space

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

Grounding Image Matching in 3D with MASt3R

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Accelerating Image Generation with Sub-path Linear Approximation Model

SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

LLMGA: Multimodal Large Language Model based Generation Assistant

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture

TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

3D Gaussian Parametric Head Model

Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations

COMPOSE: Comprehensive Portrait Shadow Editing

GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval

Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging

Holodepth: Programmable Depth-Varying Projection via Computer-Generated Holography

BeNeRF:Neural Radiance Fields from a Single Blurry Image and Event Stream

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

G3R: Gradient Guided Generalizable Reconstruction

Efficient NeRF Optimization - Not All Samples Remain Equally Hard

BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling

SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields

RS-NeRF: Neural Radiance Fields from Rolling Shutter Images

Geometry Fidelity for Spherical Images

CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance

MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis

GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time

Neural graphics texture compression supporting random access

GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views

A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis

Click-Gaussian: Interactive Segmentation to Any 3D Gaussians

McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

Non-parametric Sensor Noise Modeling and Synthesis

UpFusion: Novel View Diffusion from Unposed Sparse View Observations

MVDD: Multi-View Depth Diffusion Models

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Hypernetworks for Generalizable BRDF Representation

High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding

Structured-NeRF: Hierarchical Scene Graph with Neural Representation

3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing

Free-Editor: Zero-shot Text-driven 3D Scene Editing

Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

SceneTeller: Language-to-3D Scene Generation

Text to Layer-wise 3D Clothed Human Generation

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Temporal Residual Jacobians for Rig-free Motion Transfer

PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation

GroundUp: Rapid Sketch-Based 3D City Massing

DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching

FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation

PointNeRF++: A multi-scale, point-based Neural Radiance Field

Continuous SO(3) Equivariant Convolution for 3D Point Cloud Analysis

UMERegRobust – Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration

FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation

Learning to Adapt SAM for Segmenting Cross-domain Point Clouds

Osmosis: RGBD Diffusion Prior for Underwater Image Restoration

Differentiable Product Quantization for Memory Efficient Camera Relocalization

RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields

Light-in-Flight for a World-in-Motion

Binomial Self-compensation for Motion Error in Dynamic 3D Scanning

Non-Line-of-Sight Estimation of Fast Human Motion with Slow Scanning Imagers

Synchronization of Projective Transformations

Semicalibrated Relative Pose from an Affine Correspondence and Monodepth

GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring

LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System

SRPose: Two-view Relative Pose Estimation with Sparse Keypoints

Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences

U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Cut out the Middleman: Revisiting Pose-based Gait Recognition

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

3D Hand Sequence Recovery from Real Blurry Images and Event Stream

Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics

Learning Cross-hand Policies of High-DOF Reaching and Grasping

Free-Viewpoint Video of Outdoor Sports Using a Drone

Unsupervised Exposure Correction

Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training

Deep Cost Ray Fusion for Sparse Depth Video Completion

PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor

UniCal: Unified Neural Sensor Calibration

Multi-modal Crowd Counting via a Broker Modality

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain

SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data

UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping

PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors

Unified Local-Cloud Decision-Making via Reinforcement Learning

Generative End-to-End Autonomous Driving

MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow

Decomposition Betters Tracking Everything Everywhere

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update

Towards Real-world Event-guided Low-light Video Enhancement and Deblurring

Understanding Physical Dynamics with Counterfactual World Modeling

Prompting Future Driven Diffusion Model for Hand Motion Prediction

Nymeria: A Massive Collection of Egocentric Multi-modal Human Motion in the Wild

Motion Mamba: Efficient and Long Sequence Motion Generation

TLControl: Trajectory and Language Control for Human Motion Synthesis

ParCo: Part-Coordinating Text-to-Motion Synthesis

BAMM: Bidirectional Autoregressive Motion Model

Pose Guided Fine-Grained Sign Language Video Generation

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Animate Your Motion: Turning Still Images into Dynamic Videos

V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation

DragVideo: Interactive Drag-style Video Editing

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

Lazy Diffusion Transformer for Interactive Image Editing

WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians

Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

Commonly Interesting Images

InstructGIE: Towards Generalizable Image Editing

The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization

CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Customized Generation Reimagined: Fidelity and Editability Harmonized

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

Towards Reliable Advertising Image Generation Using Human Feedback

IMMA: Immunizing text-to-image Models against Malicious Adaptation

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

AddMe: Zero-shot Group-photo Synthesis by Inserting People into Scenes

UniProcessor: A Text-induced Unified Low-level Image Processor

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models

Assessing Sample Quality via the Latent Space of Generative Models

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Efficient Training with Denoised Neural Weights

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

DSMix: Distortion-Induced Saliency Map Based Pre-training for No-Reference Image Quality Assessment

DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior

Restoring Images in Adverse Weather Conditions via Histogram Transformer

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution

Efficient Cascaded Multiscale Adaptive Network for Image Restoration

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Taming Lookup Tables for Efficient Image Retouching

Quanta Video Restoration

Two-Stage Video Shadow Detection via Temporal-Spatial Adaption

Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework

Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic Imaging

NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration

Neural Metamorphosis

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks

LaWa: Using Latent Space for In-Generation Image Watermarking

PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments

Delving into Adversarial Robustness on Document Tampering Localization

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Forbes: Face Obfuscation Rendering via Backpropagation Refinement Scheme

Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment

Generalizable Facial Expression Recognition

Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding

MinD-3D: Reconstruct High-quality 3D objects in Human Brain

Pathformer3D: A 3D Scanpath Transformer for 360° Images

Eliminating Warping Shakes for Unsupervised Online Video Stitching

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

Semantically Guided Representation Learning For Action Anticipation

SIGMA: Sinkhorn-Guided Masked Video Modeling

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

RICA^2: Rubric-Informed, Calibrated Assessment of Actions

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

EA-VTR: Event-Aware Video-Text Retrieval

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

FunQA: Towards Surprising Video Comprehension

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Efficient Pre-training for Localized Instruction Generation of Procedural Videos

Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Visual Alignment Pre-training for Sign Language Translation

Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Spectral Subsurface Scattering for Material Classification

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Frontier-enhanced Topological Memory with Improved Exploration Awareness for Embodied Visual Navigation

Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Controllable Navigation Instruction Generation with Chain of Thought Prompting

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Quality Assured: Rethinking Annotation Strategies in Imaging AI

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

DEAL: Disentangle and Localize Concept-level Explanations for VLMs

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models

LookupViT: Compressing visual information to a limited number of tokens

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Integration of Global and Local Representations for Fine-grained Cross-modal Alignment

Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation

MyVLM: Personalizing VLMs for User-Specific Queries

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

View Selection for 3D Captioning via Diffusion Ranking

GRiT: A Generative Region-to-text Transformer for Object Understanding

FreestyleRet: Retrieving Images from Style-Diversified Queries

LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks

TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution Detection

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation

Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

Region-centric Image-Language Pretraining for Open-Vocabulary Detection

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

PSALM: Pixelwise Segmentation with Large Multi-modal Model

Grid-Attention: Enhancing Computational Efficiency of Large Vision Models without Fine-Tuning

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

On the Viability of Monocular Depth Pre-training for Semantic Segmentation

Rethinking and Improving Visual Prompt Selection for In-Context Learning Segmentation Framework

Open-Vocabulary Camouflaged Object Segmentation

From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation

Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation

Mitigating Background Shift in Class-Incremental Semantic Segmentation

LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation

Point-supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance

Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

Zero-shot Object Counting with Good Exemplars

SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection

Enhancing Source-Free Domain Adaptive Object Detection with Low-confidence Pseudo Label Distillation

MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection

AugDETR: Improving Multi-scale Learning for Detection Transformer

Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation

ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation

R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection

Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation

Continuous Memory Representation for Anomaly Detection

Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection

Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data

Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector

Fairness-aware Vision Transformer via Debiased Self-Attention

AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer

LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

Modality Translation for Object Detection Adaptation without forgetting prior knowledge

Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition

Scaling Backwards: Minimal Synthetic Pre-training?

EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification

Training-Free Model Merging for Multi-target Domain Adaptation

CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

Semantic-guided Robustness Tuning for Few-Shot Transfer Across Extreme Domain Shift

Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts

Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models

FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning

PixOOD: Pixel-Level Out-of-Distribution Detection

Distributionally Robust Loss for Long-Tailed Multi-Label Image Classification

Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data

GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

Generalized Coverage for More Robust Low-Budget Active Learning

Robust Nearest Neighbors for Source-Free Domain Adaptation under Class Distribution Shift

Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery

CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

Information Bottleneck Based Data Correction in Continual Learning

Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning

Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable

FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients

SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference

Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective

Catastrophic Overfitting: A Potential Blessing in Disguise

Cocktail Universal Adversarial Attack on Deep Neural Networks

Unveiling Privacy Risks in Stochastic Neural Networks Training: Effective Image Reconstruction from Gradients

Rethinking Data Bias: Dataset Copyright Protection via Embedding Class-wise Hidden Bias

CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing

(ends 6:30 PM)

6:30 p.m.

Reception:

Welcome Reception

(ends 7:30 PM)

WED 2 OCT

8 a.m.

Registration

(ends 6:30 PM)

9 a.m.

Demo Session 2A [9:00-12:30]

Demonstrations 9:00-12:30

AI3D Sculpt - Create 3D by rough sculpting 3D to 3D in 3D

Better Call SAL: Segment Anything in Lidar

PROCEDO: A real-time assistant for everyday procedures

Real-time Multi-Person Whole-Body Human Mesh Recovery with Multi-HMR

SLAM with Stereo Event Cameras

(ends 12:30 PM)

Oral 3A: Datasets And Benchmarking [9:00-10:30]

Orals 9:00-10:20

[9:00] PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

[9:10] UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

[9:20] Towards Model-Agnostic Dataset Condensation by Heterogeneous Models

[9:30] Parrot Captions Teach CLIP to Spot Text

[9:40] Towards Open-ended Visual Quality Comparison

[9:50] VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking

[10:00] Insect Identification in the Wild: The AMI Dataset

[10:10] MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

(ends 10:30 AM)

Oral 3B: Medical And Biological Imaging [9:00-10:30]

Orals 9:00-10:20

[9:00] PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

[9:10] Self-Supervised Video Desmoking for Laparoscopic Surgery

[9:20] CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos

[9:30] Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction

[9:40] Adaptive Correspondence Scoring for Unsupervised Medical Image Registration

[9:50] Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View

[10:00] SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images

[10:10] Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

(ends 10:30 AM)

Oral 3C: Point Clouds [9:00-10:30]

Orals 9:00-10:20

[9:00] HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

[9:10] PointLLM: Empowering Large Language Models to Understand Point Clouds

[9:20] RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation

[9:30] DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

[9:40] KeypointDETR: An End-to-End 3D Keypoint Detector

[9:50] Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather

[10:00] RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation

[10:10] Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration

(ends 10:30 AM)

10:30 a.m.

Break:

Coffee Break

(ends 11:00 AM)

Poster Session 3 [10:30-12:30]

Posters 10:30-12:30

Parrot Captions Teach CLIP to Spot Text

Towards Model-Agnostic Dataset Condensation by Heterogeneous Models

VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking

Insect Identification in the Wild: The AMI Dataset

Towards Open-ended Visual Quality Comparison

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

Adaptive Correspondence Scoring for Unsupervised Medical Image Registration

Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images

CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

PointLLM: Empowering Large Language Models to Understand Point Clouds

HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather

RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation

RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation

KeypointDETR: An End-to-End 3D Keypoint Detector

All You Need is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting

Stable Video Portraits

iHuman: Instant Animatable Digital Humans From Monocular Videos

POCA: Post-training Quantization with Temporal Alignment for Codec Avatars

Towards Image Ambient Lighting Normalization

LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models

Efficient Snapshot Spectral Imaging: Calibration-Free Parallel Structure with Aperture Diffraction Fusion

Physically Plausible Color Correction for Neural Radiance Fields

DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images

Volumetric Rendering with Baked Quadrature Fields

Depth-guided NeRF Training via Earth Mover’s Distance

RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF

Deblurring 3D Gaussian Splatting

Distractor-Free Novel View Synthesis via Exploiting Memorization Effect in Optimization

TriNeRFLet: A Wavelet Based Triplane NeRF Representation

LaRa: Efficient Large-Baseline Radiance Fields

RANRAC: Robust Neural Scene Representations via Random Ray Consensus

SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization

Learning Representations from Foundation Models for Domain Generalized Stereo Matching

CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy

Revising Densification in Gaussian Splatting

MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation

Topology-Preserving Downsampling of Binary Images

Zero-Shot Multi-Object Scene Completion

PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction

Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping

COSMU: Complete 3D human shape from monocular unconstrained images

MeshFeat: Multi-Resolution Features for Neural Fields on Meshes

Real-time 3D-aware Portrait Editing from a Single Image

An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Scene-Conditional 3D Object Stylization and Composition

DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Chains of Diffusion Models

NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

Learning Neural Deformation Representation for 4D Dynamic Shape Generation

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Towards High-Quality 3D Motion Transfer with Realistic Apparel Animation

GIVT: Generative Infinite-Vocabulary Transformers

Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

ZigMa: A DiT-style Zigzag Mamba Diffusion Model

Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems

Neural Surface Detection for Unsigned Distance Fields

VF-NeRF: Viewshed Fields for Rigid NeRF Registration

Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration

Transferable 3D Adversarial Shape Completion using Diffusion Models

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training

Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds

Domain Generalization of 3D Object Detection by Density-Resampling

Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds

Physics-informed Knowledge Transfer for Underwater Monocular Depth Estimation

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

3D Congealing: 3D-Aware Image Alignment in the Wild

Reprojection Errors as Prompts for Efficient Scene Coordinate Regression

Revisiting Calibration of Wide-Angle Radially Symmetric Cameras

RGBD GS-ICP SLAM

FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation

Rotated Orthographic Projection for Self-Supervised 3D Human Pose Estimation

Diffusion Model is a Good Pose Estimator from 3D RF-Vision

Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding

Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image

3D Reconstruction of Objects in Hands without Real World 3D Supervision

Weakly-Supervised 3D Hand Reconstruction with Knowledge Prior and Uncertainty Guidance

MANIKIN: Biomechanically Accurate Neural Inverse Kinematics for Human Motion Estimation

Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection

GraspXL: Generating Grasping Motions for Diverse Objects at Scale

HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos

Object-Aware NIR-to-Visible Translation

SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models

Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion

Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection

RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection

JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception

UAV First-Person Viewers Are Radiance Field Learners

Caltech Aerial RGB-Thermal Dataset in the Wild

V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Revisit Human-Scene Interaction via Space Occupancy

Enhancing Vectorized Map Perception with Historical Rasterized Maps

RoadPainter: Points Are Ideal Navigators for Topology transformER

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic

Self-Supervised Video Desmoking for Laparoscopic Surgery

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

LiDAR-Event Stereo Fusion with Hallucinations

Temporal-Mapping Photography for Event Cameras

Motion Aware Event Representation-driven Image Deblurring

Event-Based Motion Magnification

Bidirectional Progressive Transformer for Interaction Intention Anticipation

Reinforcement Learning via Auxillary Task Distillation

COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation

EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Human Motion Generation

MotionChain: Conversational Motion Controllers via Multimodal Prompts

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

SMooDi: Stylized Motion Diffusion Model

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Kinetic Typography Diffusion Model

DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

StableDrag: Stable Dragging for Point-based Image Editing

Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

Curved Diffusion: A Generative Model With Optical Geometry Control

Tuning-Free Image Customization with Image and Text Guidance

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling

DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment

TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Distilling Diffusion Models into Conditional GANs

Responsible Visual Editing

HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images

MagicEraser: Erasing Any Objects via Semantics-Aware Control

GenQ: Quantization in Low Data Regimes with Generative Synthetic Data

DiffiT: Diffusion Vision Transformers for Image Generation

DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation

∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

Unmasking Bias in Diffusion Model Training

Compensation Sampling for Improved Convergence in Diffusion Models

Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks

Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint

Dual-Rain: Video Rain Removal using Assertive and Gentle Teachers

A Comparative Study of Image Restoration Networks for General Backbone Network Design

OAPT: Offset-Aware Partition Transformer for Double JPEG Artifacts Removal

Domain-adaptive Video Deblurring via Test-time Blurring

Kernel Diffusion: An Alternate Approach to Blind Deconvolution

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

Kalman-Inspired Feature Propagation for Video Face Super-Resolution

RealViformer: Investigating Attention for Real-World Video Super-Resolution

Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence

Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems

Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction

Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction

Wavelet Convolutions for Large Receptive Fields

Long-term Temporal Context Gathering for Neural Video Compression

Implicit Neural Models to Extract Heart Rate from Video

A Watermark-Conditioned Diffusion Model for IP Protection

Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures

Image Manipulation Detection With Implicit Neural Representation and Limited Supervision

DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks

Learning Natural Consistency Representation for Face Forgery Video Detection

ARoFace: Alignment Robustness to Improve Low-quality Face Recognition

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

Enhancing Cross-Subject fMRI-to-Video Decoding with Global-Local Functional Alignment

Occlusion-Aware Seamless Segmentation

Keypoint Promptable Re-Identification

CoTracker: It is Better to Track Together

Free Lunch for Gait Recognition: A Novel Relation Descriptor

S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition

SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition

Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment

Look Around and Learn: Self-Training Object Detection by Exploration

Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition

Self-Supervised Video Copy Localization with Regional Token Representation

General and Task-Oriented Video Segmentation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Referring Atomic Video Action Recognition

Elysium: Exploring Object-level Perception in Videos through Semantic Integration Using MLLMs

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

Learning Video Context as Interleaved Multimodal Sequences

Multi-Modal Video Dialog State Tracking in the Wild

Towards Multimodal Sentiment Analysis Debiasing via Bias Purification

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Rethinking Normalization Layers for Domain Generalizable Person Re-identification

Dual-stage Hyperspectral Image Classification Model with Spectral Supertoken

Learning Representations of Satellite Images From Metadata Supervision

Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring

Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition

AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Navigation Instruction Generation with BEV Perception and Large Language Models

V-IRL: Grounding Virtual Intelligence in Real Life

M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Unifying 3D Vision-Language Understanding via Promptable Queries

UMBRAE: Unified Multimodal Brain Decoding

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

CoReS: Orchestrating the Dance of Reasoning and Segmentation

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Grounding Language Models for Visual Entity Recognition

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

UniCode : Learning a Unified Codebook for Multimodal Large Language Models

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

EDformer: Transformer-Based Event Denoising Across Varied Noise Levels

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

The Hard Positive Truth about Vision-Language Compositionality

HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs

LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang

Language-Image Pre-training with Long Captions

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

CIC-BART-SSA: : Controllable Image Captioning with Structured Semantic Augmentation

Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Cascade Prompt Learning for Visual-Language Model Adaptation

Gaze Target Detection Based on Head-Local-Global Coordination

Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

Towards Open-Ended Visual Recognition with Large Language Models

AFreeCA: Annotation-Free Counting for All

OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

Prioritized Semantic Learning for Zero-shot Instance Navigation

PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

ProMerge: Prompt and Merge for Unsupervised Instance Segmentation

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation

Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond

UniFS: Universal Few-shot Instance Perception with Point Representations

Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes

Adaptive Multi-task Learning for Few-shot Object Detection

FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection

Distilling Knowledge from Large-Scale Image Models for Object Detection

Revisiting Domain-Adaptive Object Detection in Adverse Weather by the Generation and Composition of High-Quality Pseudo-Labels

Operational Open-Set Recognition and PostMax Refinement

InfMAE: A Foundation Model in The Infrared Modality

AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking

Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction

Effective Lymph Nodes Detection in CT Scans Using Location Debiased Query Selection and Contrastive Query Representation in Transformer

Snuffy: Efficient Whole Slide Image Classifier

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging

TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt

Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection

Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision

SAIR: Learning Semantic-aware Implicit Representation

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

Learning with Unmasked Tokens Drives Stronger Vision Learners

Emerging Property of Masked Token for Effective Pre-training

Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Tight and Efficient Upper Bound on Spectral Norm of Convolutional Layers

FYI: Flip Your Images for Dataset Distillation

Data-to-Model Distillation: Data-Efficient Learning Framework

Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection

Active Generation for Image Classification

Contrastive Learning with Synthetic Positives

Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

Robust Calibration of Large Vision-Language Adapters

Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning

Benchmarking Spurious Bias in Few-Shot Image Classifiers

An Information Theoretical View for Out-Of-Distribution Detection

ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

Adapting to Shifting Correlations with Unlabeled Data Calibration

Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels

On Pretraining Data Diversity for Self-Supervised Learning

De-Confusing Pseudo-Labels in Source-Free Domain Adaptation

Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach

Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation

Source-Free Domain-Invariant Performance Prediction

Learning to Complement and to Defer to Multiple Users

Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation

Personalized Federated Domain-Incremental Learning based on Adaptive Knowledge Matching

Revisiting Supervision for Continual Representation Learning

Deep Companion Learning: Enhancing Generalization Through Historical Consistency

Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

Feature Diversification and Adaptation for Federated Domain Generalization

PFedEdit: Personalized Federated Learning via Automated Model Editing

Enhanced Sparsification via Stimulative Training

Dependency-aware Differentiable Neural Architecture Search

Layer-Wise Relevance Propagation with Conservation Property for ResNet

Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning

Training A Secure Model against Data-Free Model Extraction

CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks

Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection

Leveraging Imperfect Restoration for Data Availability Attack

Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNs

Augmented Neural Fine-tuning for Efficient Backdoor Purification

(ends 12:30 PM)

12:30 p.m.

Lunch:

Lunch

(ends 1:30 PM)

1:30 p.m.

Oral 4A: Neural 3D Rendering [1:30-3:30]

Orals 1:30-3:20

[1:30] Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

[1:40] Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering

[1:50] Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration

[2:00] FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information

[2:10] RaFE: Generative Radiance Fields Restoration

[2:20] Watch Your Steps: Local Image and Scene Editing by Text Instructions

[2:30] MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

[2:40] RPBG: Towards Robust Neural Point-based Graphics in the Wild

[2:50] Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

[3:00] Learning 3D-aware GANs from Unposed Images with Template Feature Field

[3:10] MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition

(ends 3:30 PM)

Oral 4B: Video Generation / Editing / Prediction [1:30-3:30]

Orals 1:30-3:20

[1:30] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

[1:40] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

[1:50] Efficient Neural Video Representation with Temporally Coherent Modulation

[2:00] Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

[2:10] Video Editing via Factorized Diffusion Distillation

[2:20] ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

[2:30] Audio-Synchronized Visual Animation

[2:40] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

[2:50] MotionDirector: Motion Customization of Text-to-Video Diffusion Models

[3:00] ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model

[3:10] Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

(ends 3:30 PM)

Oral 4C: Humans: Biometrics, Pose And Motion [1:30-3:30]

Orals 1:30-3:20

[1:30] AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

[1:40] Sapiens: Foundation for Human Vision Models

[1:50] POET: Prompt Offset Tuning for Continual Human Action Adaptation

[2:00] Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation

[2:10] SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

[2:20] UGG: Unified Generative Grasping

[2:30] NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

[2:40] Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

[2:50] LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment

[3:00] Controllable Human-Object Interaction Synthesis

[3:10] NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction

(ends 3:30 PM)

2:30 p.m.

Demo Session 2B [2:30-6:00]

Demonstrations 2:30-6:00

Automating Parasite Egg Detection: Artificial Intelligence based Kubic FLOTAC microscope (KFM)

Controllable Neural Reconstruction for Autonomous Driving

Live Demo of Matching and Dense 3D Reconstruction with MASt3R

ON-STAGE 3D: Link-based Investigation into Spatial Iconographic Heritage

Spiky DVS Piano

(ends 6:00 PM)

3:30 p.m.

Keynote:

Fair, transparent, and accountable AI: What is legally required, what is ethically desired, and what is technically feasible?

Sandra Wachter

(ends 4:30 PM)

4:30 p.m.

Poster Session 4 [4:30-6:30]

Posters 4:30-6:30

TimeLens-XL: Real-time Event-based Video Frame Interpolation with Large Motion

MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition

RaFE: Generative Radiance Fields Restoration

Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration

FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information

Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

RPBG: Towards Robust Neural Point-based Graphics in the Wild

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Learning 3D-aware GANs from Unposed Images with Template Feature Field

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Video Editing via Factorized Diffusion Distillation

Efficient Neural Video Representation with Temporally Coherent Modulation

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction

UGG: Unified Generative Grasping

LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment

Controllable Human-Object Interaction Synthesis

Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation

POET: Prompt Offset Tuning for Continual Human Action Adaptation

NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

Sapiens: Foundation for Human Vision Models

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Modeling and Driving Human Body Soundfields through Acoustic Primitives

Let the Avatar Talk using Texts without Paired Training Data

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

Relightable Neural Actor with Intrinsic Decomposition and Pose Control

3R-INN: How to be climate friendly while consuming/delivering videos?

Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement

Intrinsic Single-Image HDR Reconstruction

Domain Reduction Strategy for Non-Line-of-Sight Imaging

Synthesizing Time-varying BRDFs via Latent Space

Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering

Instant Uncertainty Calibration of NeRFs Using a Meta-Calibrator

GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views

Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization

Collaborative Control for Geometry-Conditioned PBR Image Generation

KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter

Weight Conditioning for Smooth Optimization of Neural Networks

URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields

MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo

TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion

DoubleTake: Geometry Guided Depth Estimation

Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal

SAGS: Structure-Aware 3D Gaussian Splatting

Compact 3D Scene Representation via Self-Organizing Gaussian Grids

HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression

GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction

Concise Plane Arrangements for Low-Poly Surface and Volume Modelling

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting

GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning

Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation

FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis

Retargeting Visual Data with Deformation Fields

LatentEditor: Text Driven Local Editing of 3D Scenes

StyleCity: Large-Scale 3D Urban Scenes Stylization

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

AWOL: Analysis WithOut synthesis using Language

Improving Virtual Try-On with Garment-focused Diffusion Models

GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects

Generating 3D House Wireframes with Semantics

LayoutFlow: Flow Matching for Layout Generation

Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching

Scalar Function Topology Divergence: Comparing Topology of 3D Objects

DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction

Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

FLAT: Flux-aware Imperceptible Adversarial Attacks on 3D Point Clouds

Frugal 3D Point Cloud Model Training via Progressive Near Point Filtering and Fused Aggregation

SemReg: Semantics Constrained Point Cloud Registration

GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding

Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning

RangeLDM: Fast Realistic LiDAR Point Cloud Generation

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization

Adaptive Annealing for Robust Averaging

Resolving Scale Ambiguity in Multi-view 3D Reconstruction using Dual-Pixel Sensors

Consistent 3D Line Mapping

Robust Incremental Structure-from-Motion with Hybrid Features

Gravity-aligned Rotation Averaging with Circular Regression

GeoCalib: Learning Single-image Calibration with Geometric Optimization

Real-time Holistic Robot Pose Estimation with Unknown States

Learning Neural Volumetric Pose Features for Camera Localization

LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

MLPHand: Real Time Multi-View 3D Hand Reconstruction via MLP Modeling

WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation

RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency

An Economic Framework for 6-DoF Grasp Detection

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation

OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

Reinforcement Learning Meets Visual Odometry

Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Camera-LiDAR Cross-modality Gait Recognition

TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Autonomous Driving

3D Single-object Tracking in Point Clouds with High Temporal Variation

LISO: Lidar-only Self-Supervised 3D Object Detection

MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty

Reliability in Semantic Segmentation: Can We Use Synthetic Data?

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Fully Sparse 3D Occupancy Prediction

EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding

Continuity Preserving Online CenterLine Graph Learning

FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving

Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)

Solving Motion Planning Tasks with a Scalable Generative Model

Enhanced Motion Forecasting with Visual Relation Reasoning

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Event-Aided Time-To-Collision Estimation for Autonomous Driving

Event-based Mosaicing Bundle Adjustment

Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Learning-based Axial Video Motion Magnification

Motion Keyframe Interpolation for Any Human Skeleton using Point Cloud-based Human Motion Data Homogenisation

Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs

Scalable Group Choreography via Variational Phase Manifold Learning

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

Drag Anything: Motion Control for Anything using Entity Representation

Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores

Audio-Synchronized Visual Animation

E.T. the Exceptional Trajectory: Text-to-camera-trajectory generation with character awareness

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Object-Centric Diffusion for Efficient Video Editing

GroupDiff: Diffusion-based Group Portrait Editing

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Towards compact reversible image representations for neural style transfer

InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing

When and How do negative prompts take effect?

SPIRE: Semantic Prompt-Driven Image Restoration

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

LogoSticker: Inserting Logos into Diffusion Models for Customized Generation

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models

Implicit Concept Removal of Diffusion Models

NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Global Counterfactual Directions

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation

Beta-Tuned Timestep Diffusion Model

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

InstructIR: High-Quality Image Restoration Following Human Instructions

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

OneRestore: A Universal Restoration Framework for Composite Degradation

UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt

Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution

When Fast Fourier Transform Meets Transformer for Image Restoration

Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation

SuperGaussian: Repurposing Video Models for 3D Super Resolution

Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers

Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems

Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network

Spike-Temporal Latent Representation for Energy-Efficient Event-to-Video Reconstruction

Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data

A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks

Skeleton Recall Loss for Connectivity Conserving and Resource Efficient Segmentation of Thin Tubular Structures

Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection

Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing

Real Appearance Modeling for More General Deepfake Detection

SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

Norface: Improving Facial Expression Analysis by Identity Normalization

Open-Set Biometrics: Beyond Good Closed-Set Models

Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals

PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion

Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks

SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking

Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition

VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation

Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

Bayesian Evidential Deep Learning for Online Action Detection

Event Camera Data Dense Pre-training

Unsupervised Moving Object Segmentation with Atmospheric Turbulence

Beyond MOT: Semantic Multi-Object Tracking

MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning

Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition

Open Vocabulary Multi-Label Video Classification

R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Leveraging temporal contextualization for video action recognition

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Label-anticipated Event Disentanglement for Audio-Visual Video Parsing

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Uncertainty-aware sign language video retrieval with probability distribution modeling

NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition

Domain Shifting: A Generalized Solution for Heterogeneous Cross-Modality Person Re-Identification

HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis

VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition

Embodied Understanding of Driving Scenarios

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Finding Visual Task Vectors

ControlLLM: Augment Language Models with Tools by Searching on Graphs

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

Uni3DL: A Unified Model for 3D Vision-Language Understanding

CrossScore: A Multi-View Approach to Image Evaluation and Scoring

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Quantized Prompt for Efficient Generalization of Vision-Language Models

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

MultiGen: Zero-shot Image Generation from Multi-modal Prompts

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

VeCLIP: Improving CLIP Training via Visual-enriched Captions

ControlCap: Controllable Region-level Captioning

Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models

Look Hear: Gaze Prediction for Speech-directed Human Attention

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection

Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding

LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Attention Decomposition for Cross-Domain Semantic Segmentation

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

MC-PanDA: Mask Confidence for Panoptic Domain Adaptation

OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation

Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation

Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation

ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation

On-the-fly Category Discovery for LiDAR Semantic Segmentation

CONDA: Condensed Deep Association Learning for Co-Salient Object Detection.

General Geometry-aware Weakly Supervised 3D Object Detection

CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection

MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks

Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights

Rethinking Features-Fused-Pyramid-Neck for Object Detection

3D Small Object Detection with Dynamic Spatial Pruning

Watching it in Dark: A Target-aware Representation Learning Framework for High-Level Vision Tasks in Low Illumination

Gradient-Aware for Class-Imbalanced Semi-supervised Medical Image Segmentation

Test-Time Stain Adaptation with Diffusion Models for Histopathology Image Classification

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

ChEX: Interactive Localization and Region Description in Chest X-rays

A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization

Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes

FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation

Efficient Training of Spiking Neural Networks with Multi-Parallel Implicit Stream Architecture

DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

SeiT++: Masked Token Modeling Improves Storage-efficient Training

AMD: Automatic Multi-step Distillation of Large-scale Vision Models

Stitched ViTs are Flexible Vision Backbones

MetaAug: Meta-Data Augmentation for Post-Training Quantization

Straightforward Layer-wise Pruning for More Efficient Visual Adaptation

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

Robust Multimodal Learning via Representation Decoupling

SUMix: Mixup with Semantic and Uncertain Information

Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning

Linking in Style: Understanding learned features in deep learning models

Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort

Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning

Strike a Balance in Continual Panoptic Segmentation

IGNORE: Information Gap-based False Negative Loss Rejection for Single Positive Multi-Label Learning

Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning

Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation

Learning to Distinguish Samples for Generalized Category Discovery

Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation

DiffClass: Diffusion-Based Class Incremental Learning

Direct Distillation between Different Domains

MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory

PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning

PromptFusion: Decoupling Stability and Plasticity for Continual Learning

One-stage Prompt-based Continual Learning

Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution Images

Idling Neurons, Appropriately Lenient Workload During Fine-tuning Leads to Better Generalization

How to Train the Teacher Model for Effective Knowledge Distillation

Local and Global Flatness for Federated Domain Generalization

Dataset Quantization with Active Learning based Adaptive Sampling

DεpS: Delayed ε-Shrinking for Faster Once-For-All Training

Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search

On Spectral Properties of Gradient-based Explanation Methods

Cross-Input Certified Training for Universal Perturbations

Interpretability-Guided Test-Time Adversarial Defense

Exploring Guided Sampling of Conditional GANs

Self-Supervised Representation Learning for Adversarial Attack Detection

Non-transferable Pruning

On the Vulnerability of Skip Connections to Model Inversion Attacks

Clean & Compact: Efficient Data-Free Backdoor Defense with Model Compactness

Spiking Wavelet Transformer

PFGS: High Fidelity Point Cloud Rendering via Feature Splatting

(ends 6:30 PM)

Break:

Coffee Break

(ends 5:00 PM)

THU 3 OCT

8 a.m.

Registration

(ends 6:30 PM)

9 a.m.

Demo Session 3A [9:00-12:30]

Demonstrations 9:00-12:30

COMO: Compact Mapping and Odometry

H-Unique: 3D Hand Reconstruction and Automated Mapping of Anatomical Detail for Forensic Identification

Multi-Setup Depth Perception through Virtual Image Hallucination

ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image

Showcase: Contrasting Deepfakes Embeddings

(ends 12:30 PM)

Oral 5A: Segmentation [9:00-10:30]

Orals 9:00-10:20

[9:00] WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

[9:10] AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation

[9:20] CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

[9:30] Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

[9:40] Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels

[9:50] ActionVOS: Actions as Prompts for Video Object Segmentation

[10:00] Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

[10:10] Diffusion Models for Open-Vocabulary Segmentation

(ends 10:30 AM)

Oral 5B: Vision Applications [9:00-10:30]

Orals 9:00-10:20

[9:00] Robust Fitting on a Gate Quantum Computer

[9:10] Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views

[9:20] Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

[9:30] MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery

[9:40] Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception

[9:50] Faceptor: A Generalist Model for Face Perception

[10:00] A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability

[10:10] COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

(ends 10:30 AM)

Oral 5C: Representation Learning [9:00-10:30]

Orals 9:00-10:20

[9:00] PiTe: Pixel-Temporal Alignment for Large Video-Language Model

[9:10] Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

[9:20] Emergent Visual-Semantic Hierarchies in Image-Text Representations

[9:30] Learning Multimodal Latent Generative Models with Energy-Based Prior

[9:40] Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

[9:50] SINDER: Repairing the Singular Defects of DINOv2

[10:00] Denoising Vision Transformers

[10:10] Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking

(ends 10:30 AM)

10:30 a.m.

Poster Session 5 [10:30-12:30]

Posters 10:30-12:30

Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering

Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

Diffusion Models for Open-Vocabulary Segmentation

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels

ActionVOS: Actions as Prompts for Video Object Segmentation

WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability

COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views

MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery

Faceptor: A Generalist Model for Face Perception

Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking

Learning Multimodal Latent Generative Models with Energy-Based Prior

Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

SINDER: Repairing the Singular Defects of DINOv2

Emergent Visual-Semantic Hierarchies in Image-Text Representations

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

Denoising Vision Transformers

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

ScanTalk: 3D Talking Heads from Unregistered Scans

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Fast Registration of Photorealistic Avatars for VR Facial Animation

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams

Wavelength-Embedding-guided Filter-Array Transformer for Spectral Demosaicing

Learned HDR Image Compression for Perceptually Optimal Storage and Display

Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging

Leveraging Thermal Modality to Enhance Reconstruction in Low-Light Conditions

The Sky's the Limit: Relightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility

A Probability-guided Sampler for Neural Implicit Surface Rendering

REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices

Dynamic Neural Radiance Field From Defocused Monocular Video

VersatileGaussian: Real-time Neural Rendering for Versatile Tasks using Gaussian Splatting

DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic Scenes

NeRF-XL: NeRF at Any Scale with Multi-GPU

G2fR: Frequency Regularization in Grid-based Feature Encoding Neural Radiance Fields

InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction

MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections

Disentangled Generation and Aggregation for Robust Radiance Fields

CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians

Surface Reconstruction for 3D Gaussian Splatting via Local Structural Hints

Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting

An Adaptive Screen-Space Meshing Approach for Normal Integration

Fast View Synthesis of Casual Videos with Soup-of-Planes

4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Viewpoint textual inversion: discovering scene representations and 3D view control in 2D diffusion models

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

External Knowledge Enhanced 3D Scene Generation from Sketch

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

3DEgo: 3D Editing on the Go!

Learning Pseudo 3D Guidance for View-consistent Texturing with 2D Diffusion

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Diverse Text-to-3D Synthesis with Augmented Text Embedding

SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction

Data Collection-free Masked Video Modeling

Vista3D: unravel the 3d darkside of a single image

Diff-Reg: Diffusion Model in Doubly Stochastic Matrix Space for Registration Problem

NICP: Neural ICP for 3D Human Registration at Scale

TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds

EINet: Point Cloud Completion via Extrapolation and Interpolation

DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction

Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning

CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection

Formula-Supervised Visual-Geometric Pre-training

Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning

Raising the Ceiling: Conflict-Free Local Feature Matching with Dynamic View Switching

DGD: Dynamic 3D Gaussians Distillation

SHIC: Shape-Image Correspondences with no Keypoint Supervision

LineFit: A Geometric Approach for Fitting Line Segments in Images

Global Structure-from-Motion Revisited

Robust Fitting on a Gate Quantum Computer

The Nerfect Match: Exploring NeRF Features for Visual Localization

A Cephalometric Landmark Regression Method based on Dual-encoder for High-resolution X-ray Image

FoundPose: Unseen Object Pose Estimation with Foundation Features

PoseSOR: Human Pose Can Guide Our Attention

A Graph-Based Approach for Category-Agnostic Pose Estimation

3DSA:Multi-View 3D Human Pose Estimation With 3D Space Attention Mechanisms

HPE-Li: WiFi-enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation

HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation

WHAC: World-grounded Humans and Cameras

EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset

3D Human Pose Estimation via Non-Causal Retentive Networks

Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Möbius Transform for Mitigating Perspective Distortions in Representation Learning

UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection

LiDAR-based All-weather 3D Object Detection via Prompting and Distilling 4D Radar

SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather

Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception

SkyScenes: A Synthetic Dataset for Aerial Scene Understanding

DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model

UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

Stream Query Denoising for Vectorized HD-Map Construction

Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention

Early Anticipation of Driving Maneuvers

Adaptive Human Trajectory Prediction via Latent Corridors

Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model

Probabilistic Weather Forecasting with Deterministic Guidance-based Diffusion Model

Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation

Temporal Event Stereo via Joint Learning with Stereoscopic Flow

FARSE-CNN: Fully Asynchronous, Recurrent and Sparse Event-Based CNN

Event-Adapted Video Super-Resolution

Diffusion Models as Optimizers for Efficient Planning in Offline RL

Scene-aware Human Motion Forecasting via Mutual Distance Prediction

CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Co-speech Gesture Video Generation with 3D Human Meshes

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

MEVG : Multi-event Video Generation with Text-to-Video Models

HARIVO: Harnessing Text-to-Image Models for Video Generation

WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing

RegionDrag: Fast Region-Based Image Editing with Diffusion Models

TurboEdit: Real-time text-based disentangled real image editing

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Scaling Up Personalized Image Aesthetic Assessment via Task Vector Customization

FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

MONTRAGE: Monitoring Training for Attribution of Generative Diffusion Models

ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations

Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution

DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models

AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation

Memory-Efficient Fine-Tuning for Quantized Diffusion Model

SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

Diffusion for Natural Image Matting

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts

Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration

Confidence-Based Iterative Generation for Real-World Image Super-Resolution

Efficient Frequency-Domain Image Deraining with Contrastive Regularization

Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding

SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging

Rethinking Image Super Resolution from Training Data Perspectives

Accelerating Image Super-Resolution Networks with Pixel-Level Classification

Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks

Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model

Uncertainty-Driven Spectral Compressive Imaging with Spatial-Frequency Transformer

Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing

Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers

RadEdit: stress-testing biomedical vision models via diffusion image editing

Rate-Distortion-Cognition Controllable Versatile Neural Image Compression

Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design

Fast Encoding and Decoding for Implicit Video Representation

Implicit Steganography Beyond the Constraints of Modality

Certifiably Robust Image Watermark

DSA: Discriminative Scatter Analysis for Early Smoke Segmentation

AdaIFL: Adaptive Image Forgery Localization via a Dynamic and Importance-aware Transformer Network

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Face Reconstruction Transfer Attack as Out-of-Distribution Generalization

Toward Tiny and High-quality Facial Makeup with Data Amplify Learning

Facial Affective Behavior Analysis with Instruction Tuning

VideoClusterNet: Self-Supervised and Adaptive Face Clustering for Videos

When Do We Not Need Larger Vision Models?

Open Panoramic Segmentation

PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

Self-Supervised Any-Point Tracking by Contrastive Random Walks

WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

OMR: Occlusion-Aware Memory-Based Refinement for Video Lane Detection

Improving Video Segmentation via Dynamic Anchor Queries

VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement

Merlin: Empowering Multimodal LLMs with Foresight Minds

STSP: Spatial-Temporal Subspace Projection for Video Class-incremental Learning

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Contextual Correspondence Matters: Bidirectional Graph Matching for Video Summarization

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

AMEGO: Active Memory from long EGOcentric videos

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

Delving Deep into Engagement Prediction of Short Videos

LITA: Language Instructed Temporal-Localization Assistant

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

Siamese Vision Transformers are Scalable Audio-visual Learners

EvSign: Sign Language Recognition and Translation with Streaming Events

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Masked Angle-Aware Autoencoder for Remote Sensing Images

Revisit Anything: Visual Place Recognition via Image Segment Retrieval

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Reinforcement Learning Friendly Vision-Language Model for Minecraft

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

See and Think: Embodied Agent in Virtual Environment

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

Object-Oriented Anchoring and Modal Alignment in Multimodal Learning

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

ReGround: Improving Textual and Spatial Grounding at No Cost

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

Tokenize Anything via Prompting

FuseTeacher: Modality-fused Encoders are Strong Vision Supervisors

De-confounded Gaze Estimation

GalLop: Learning global and local prompts for vision-language models

OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Can OOD Object Detectors Learn from Foundation Models?

VEON: Vocabulary-Enhanced Occupancy Prediction

Efficient Vision Transformers with Partial Attention

SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

ReMamber: Referring Image Segmentation with Mamba Twister

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties

Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets

Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels

Continual Learning and Unknown Object Discovery in 3D Scenes via Self-Distillation

Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation

Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier

Bayesian Self-Training for Semi-Supervised 3D Segmentation

Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation

CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection

Interactive 3D Object Detection with Prompts

SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection

Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection

Benchmarking Object Detectors with COCO: A New Path Forward

Frequency-Spatial Entanglement Learning for Camouflaged Object Detection

GRA: Detecting Oriented Objects through Group-wise Rotating and Attention

DQ-DETR: DETR with Dynamic Query for Tiny Object Detection

AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval

Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation

Unleashing the Power of Prompt-driven Nucleus Instance Segmentation

cDP-MIL: Robust Multiple Instance Learning via Cascaded Dirichlet Process

Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification

Learning with Counterfactual Explanations for Radiology Report Generation

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

Few-shot Defect Image Generation based on Consistency Modeling

Placing Objects in Context via Inpainting for Out-of-distribution Segmentation

Learning Diffusion Models for Multi-View Anomaly Detection

Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent

Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training

SNP: Structured Neuron-level Pruning to Preserve Attention Scores

Tiny Models are the Computational Saver for Large Models

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Trainable Highly-expressive Activation Functions

HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion

To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of Point Cloud Transfer Learning

SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning

Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation

Diagnosing and Re-learning for Balanced Multimodal Learning

Visual Prompting via Partial Optimal Transport

Pseudo-Labelling Should Be Aware of Disguising Channel Activations

Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Unsupervised Representation Learning by Balanced Self Attention Matching

Optimal Transport of Diverse Unsupervised Tasks for Robust Learning from Noisy Few-Shot Data

Gradient-based Out-of-Distribution Detection

SLIM: Spuriousness Mitigation with Minimal Human Annotations

Modeling Label Correlations with Latent Context for Multi-Label Recognition

Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch

Foster Adaptivity and Balance in Learning with Noisy Labels

Self-Guided Generation of Minority Samples Using Diffusion Models

Self-Cooperation Knowledge Distillation for Novel Class Discovery

Non-Exemplar Domain Incremental Learning via Cross-Domain Concept Integration

Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams

Few-shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt

Exemplar-free Continual Representation Learning via Learnable Drift Compensation

Open-World Dynamic Prompt and Continual Visual Representation Learning

Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks

Simple Unsupervised Knowledge Distillation With Space Similarity

AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

Dataset Growth

Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation

MO-EMT-NAS: Multi-Objective Continuous Transfer of Architectural Knowledge Between Tasks from Different Datasets

BAFFLE: A Baseline of Backpropagation-Free Federated Learning

On the Evaluation Consistency of Attribution-based Explanations

Debiasing surgeon: fantastic weights and how to find them

Auto-GAS: Automated Proxy Discovery for Training-free Generative Architecture Search

Improving Adversarial Transferability via Model Alignment

Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation

Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures

CipherDM: Secure Three-Party Inference for Diffusion Model Sampling

UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

StereoGlue: Joint Feature Matching and Robust Estimation

ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency

(ends 12:30 PM)

Break:

Coffee Break

(ends 11:00 AM)

12:30 p.m.

Lunch:

Lunch

(ends 1:30 PM)

1:30 p.m.

Oral 6A: Generative Models II [1:30-3:30]

Orals 1:30-3:20

[1:30] Controlling the World by Sleight of Hand

[1:40] Pyramid Diffusion for Fine 3D Large Scene Generation

[1:50] FMBoost: Boosting Latent Diffusion with Flow Matching

[2:00] ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

[2:10] Exact Diffusion Inversion via Bidirectional Integration Approximation

[2:20] Tackling Structural Hallucination in Image Translation with Local Diffusion

[2:30] Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

[2:40] Adversarial Diffusion Distillation

[2:50] Arc2Face: A Foundation Model for ID-Consistent Human Faces

[3:00] Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning

[3:10] OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

(ends 3:30 PM)

Oral 6B: Video Understanding [1:30-3:30]

Orals 1:30-3:20

[1:30] E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

[1:40] Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

[1:50] Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

[2:00] MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment

[2:10] C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

[2:20] LongVLM: Efficient Long Video Understanding via Large Language Models

[2:30] Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

[2:40] Towards Neuro-Symbolic Video Understanding

[2:50] Classification Matters: Improving Video Action Detection with Class-Specific Attention

[3:00] DEVIAS: Learning Disentangled Video Representations of Action and Scene

[3:10] Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

(ends 3:30 PM)

Oral 6C: Vision And Other Modalities [1:30-3:30]

Orals 1:30-3:20

[1:30] GiT: Towards Generalist Vision Transformer through Universal Language Interface

[1:40] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

[1:50] Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

[2:00] MMBENCH: Is Your Multi-Modal Model an All-around Player?

[2:10] Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

[2:20] Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

[2:30] A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

[2:40] HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

[2:50] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

[3:00] uCAP: An Unsupervised Prompting Method for Vision-Language Models

[3:10] BRAVE: Broadening the visual encoding of vision-language models

(ends 3:30 PM)

2:30 p.m.

Demo Session 3B [2:30-6:00]

Demonstrations 2:30-6:00

Automatic Data Curation for Self-Supervised Learning of Visual Features

Fruit Ninja with an Event Camera

Open-Vocabulary Interactive 3D Scenes with Spot

R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

(ends 6:00 PM)

3:30 p.m.

Keynote:

Is distribution shift still an AI problem?

Sanmi Koyejo

(ends 4:30 PM)

4:30 p.m.

Poster Session 6 [4:30-6:30]

Posters 4:30-6:30

Exact Diffusion Inversion via Bidirectional Integration Approximation

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

Tackling Structural Hallucination in Image Translation with Local Diffusion

Adversarial Diffusion Distillation

Pyramid Diffusion for Fine 3D Large Scene Generation

Controlling the World by Sleight of Hand

Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning

OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Towards Neuro-Symbolic Video Understanding

DEVIAS: Learning Disentangled Video Representations of Action and Scene

Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

LongVLM: Efficient Long Video Understanding via Large Language Models

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

BRAVE: Broadening the visual encoding of vision-language models

MMBENCH: Is Your Multi-Modal Model an All-around Player?

uCAP: An Unsupervised Prompting Method for Vision-Language Models

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°

Tri^{2}-plane: Thinking Head Avatar via Feature Pyramid

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

AnimateMe: 4D Facial Expressions via Diffusion Models

Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes

Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography

Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats

Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM

Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis

Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending

UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation

City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web

Few-shot NeRF by Adaptive Rendering Loss Regularization

BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting

Generalizable Human Gaussians for Sparse View Synthesis

Invertible Neural Warp for NeRF

PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects

Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images

SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization

Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections

3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting

HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes

GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering

EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS

End-to-End Rate-Distortion Optimized 3D Gaussian Representation

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

Human Hair Reconstruction with Strand-Aligned 3D Gaussians

Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image

Lagrangian Hashing for Compressed Neural Field Representations

GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation

LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

Synthesizing Environment-Specific People in Photographs

Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

Shapefusion: 3D localized human diffusion models

Fast Sprite Decomposition from Animated Graphics

Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution

WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation

Dolfin: Diffusion Layout Transformers without Autoencoder

MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes

RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion

Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds

FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy

SEED: A Simple and Effective 3D DETR in Point Clouds

ProtoComp: Diverse Point Cloud Completion with Controllable Prototype

CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation

Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes

Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains

Multi-modal Relation Distillation for Unified 3D Representation Learning

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Single-Photon 3D Imaging with Equi-Depth Photon Histograms

Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment

SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes

Leveraging scale- and orientation-covariant features for planar motion estimation

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision

TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly

SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction

VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

Human Pose Recognition via Occlusion-Preserving Abstract Images

RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark

6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

On the Utility of 3D Hand Poses for Action Recognition

Multi-Person Pose Forecasting with Individual Interaction Perceptron and Prior Learning

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

Revisit Self-supervision with Local Structure-from-Motion

AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation

High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior

Weakly-supervised Camera Localization by Ground-to-satellite Image Registration

Benchmarking the Robustness of Cross-view Geo-localization Models

Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training

LEROjD: Lidar Extended Radar-Only Object Detection

Towards Stable 3D Object Detection

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

ADMap: Anti-disturbance Framework for Vectorized HD Map Construction

Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction

CarFormer: Self-Driving with Learned Object-Centric Representations

DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction

NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving

Visual Relationship Transformation

Local All-Pair Correspondence for Point Tracking

Un-EVIMO: Unsupervised Event-based Independent Motion Segmentation

Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo

Physical-Based Event Camera Simulator

REDIR: Refocus-free Event-based De-occlusion Image Reconstruction

Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising

Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Self-Supervised Audio-Visual Soundscape Stylization

TC4D: Trajectory-Conditioned Text-to-4D Generation

LivePhoto: Real Image Animation with Text-guided Motion Control

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Photorealistic Video Generation with Diffusion Models

High-Fidelity and Transferable NeRF Editing by Frequency Decomposition

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

Editable Image Elements for Controllable Synthesis

Implicit Style-Content Separation using B-LoRA

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

EraseDraw : Learning to Insert Objects by Erasing Them from Images

Text2Place: Affordance-aware Text Guided Human Placement

ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation

Label-free Neural Semantic Image Synthesis

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Context Diffusion: In-Context Aware Image Generation

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis

SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

Large-scale Reinforcement Learning for Diffusion Models

Latent Guard: a Safety Framework for Text-to-image Generation

Arc2Face: A Foundation Model for ID-Consistent Human Faces

GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images

Closed-Loop Unsupervised Representation Disentanglement with $\beta$-VAE Distillation and Diffusion Probabilistic Feedback

Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion

Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

FMBoost: Boosting Latent Diffusion with Flow Matching

AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation

Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model

LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery

Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution

Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow

DualDn: Dual-domain Denoising via Differentiable ISP

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation

Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery

Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images

Energy-induced Explicit quantification for Multi-modality MRI fusion

WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields

Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification

Enhancing Tampered Text Detection through Frequency Feature Fusion and Decomposition

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing

Personalized Privacy Protection Mask Against Unauthorized Facial Recognition

GRAPE: Generalizable and Robust Multi-view Facial Capture

Seeing Faces in Things: A Model and Dataset for Pareidolia

Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers

OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Upper-body Hierarchical Graph for Skeleton Based Emotion Recognition in Assistive Driving

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast

Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition

Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment

Classification Matters: Improving Video Action Detection with Class-Specific Attention

HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

Appearance-based Refinement for Object-Centric Motion Segmentation

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Fine-grained Dynamic Network for Generic Event Boundary Detection

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

Self-supervised visual learning from interactions with objects

Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval

Video Question Answering with Procedural Programs

ViLA: Efficient Video-Language Alignment for Video Question Answering

ST-LLM: Large Language Models Are Effective Temporal Learners

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Nonverbal Interaction Detection

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Human-in-the-Loop Visual Re-ID for Population Size Estimation

PreLAR: World Model Pre-training with Learnable Action Representation

Learning to Build by Building Your Own Instructions

Situated Instruction Following

Where am I? Scene Retrieval with Language

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

SegPoint: Segment Any Point Cloud via Large Language Model

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images

BLINK: Multimodal Large Language Models Can See but Not Perceive

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Teach CLIP to Develop a Number Sense for Ordinal Regression

Common Sense Reasoning for Deep Fake Detection

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Evaluating Text-to-Visual Generation with Image-to-Text Generation

DOCCI: Descriptions of Connected and Contrasting Images

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

Conceptual Codebook Learning for Vision-Language Models

Do Generalised Classifiers really work on Human Drawn Sketches?

3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

Discovering Unwritten Visual Classifiers with Large Language Models

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

Rotary Position Embedding for Vision Transformer

Multi-branch Collaborative Learning Network for 3D Visual Grounding

SILC: Improving Vision Language Pretraining with Self-Distillation

LiteSAM is Actually what you Need for segment Everything

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

Click Prompt Learning with Optimal Transport for Interactive Segmentation

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Segment and Recognize Anything at Any Granularity

SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images

Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation

AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation

SAM-guided Graph Cut for 3D Instance Segmentation

Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation

Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection

Shifted Autoencoders for Point Annotation Restoration in Object Counting

Learning Camouflaged Object Detection from Noisy Pseudo Label

Just a Hint: Point-Supervised Camouflaged Object Detection

Rectify the Regression Bias in Long-Tailed Object Detection

PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Visible and Clear: Finding Tiny Objects in Difference Map

IRGen: Generative Modeling for Image Retrieval

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation

Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

BugNIST - a Large Volumetric Dataset for Detection under Domain Shift

AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset

GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

Attention Beats Linear for Fast Implicit Neural Representation Generation

OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

AttnZero: Efficient Attention Discovery for Vision Transformers

Isomorphic Pruning for Vision Models

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Robustness Tokens: Towards Adversarial Robustness of Transformers

Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration

Neural Spectral Decomposition for Dataset Distillation

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

Adaptive Multi-head Contrastive Learning

Unsqueeze [CLS] Bottleneck to Learn Rich Representations

Improving Zero-Shot Generalization for CLIP with Variational Adapter

Learning to Obstruct Few-Shot Image Classification over Restricted Classes

Improving Hyperbolic Representations via Gromov-Wasserstein Regularization

HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions

Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density

SCOD: From Heuristics to Theory

LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration

SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Labeled Data Selection for Category Discovery

PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation

CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning

Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling

MagMax: Leveraging Model Merging for Seamless Continual Learning

Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning

Learning to Unlearn for Robust Machine Unlearning

UNIC: Universal Classification Models via Multi-teacher Distillation

Distributed Active Client Selection With Noisy Clients Using Model Association Scores

Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching

FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning

Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge

Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting

A high-quality robust diffusion framework for corrupted dataset

Similarity of Neural Architectures using Adversarial Attack Transferability

Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data

Resilience of Entropy Model in Distributed Neural Networks

WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning

Instant 3D Human Avatar Generation using Image Diffusion Models

(ends 6:30 PM)

Break:

Coffee Break

(ends 5:00 PM)

7:30 p.m.

Reception:

Conference Dinner Party (Ticketed Event)

(ends 11:59 PM)

FRI 4 OCT

8 a.m.

Registration

(ends 12:30 PM)

8:30 a.m.

Oral 7A: Learning Architectures, Transfer, Continual And Long-Tail [8:30-10:30]

Orals 8:30-10:10

[8:30] On the Topology Awareness and Generalization Performance of Graph Neural Networks

[8:40] Improving Knowledge Distillation via Regularizing Feature Direction and Norm

[8:50] Spline-based Transformers

[9:00] Anytime Continual Learning for Open Vocabulary Classification

[9:10] Weighted Ensemble Models Are Strong Continual Learners

[9:20] COD: Learning Conditional Invariant Representation for Domain Adaptation Regression

[9:30] Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning

[9:40] Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

[9:50] Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

[10:00] HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

(ends 10:30 AM)

Oral 7B: Adversarial Learning And Privacy [8:30-10:30]

Orals 8:30-10:00

[8:30] Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

[8:40] Adversarial Robustification via Text-to-Image Diffusion Models

[8:50] Flatness-aware Sequential Learning Generates Resilient Backdoors

[9:00] A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks

[9:10] Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks

[9:20] R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

[9:30] Privacy-Preserving Adaptive Re-Identification without Image Transfer

[9:40] Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

[9:50] Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

(ends 10:30 AM)

Oral 7C: Optimization And Theory [8:30-10:30]

Orals 8:30-10:10

[8:30] A Direct Approach to Viewing Graph Solvability

[8:40] Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees

[8:50] Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering

[9:00] A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures

[9:10] Physics-Based Interaction with 3D Objects via Video Generation

[9:20] Shape from Heat Conduction

[9:30] Rasterized Edge Gradients: Handling Discontinuities Differentially

[9:40] ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

[9:50] Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

[10:00] Model Stock: All we need is just a few fine-tuned models

(ends 10:30 AM)

10:30 a.m.

Poster Session 7 [10:30-12:30]

Posters 10:30-12:30

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

Flatness-aware Sequential Learning Generates Resilient Backdoors

Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks

Adversarial Robustification via Text-to-Image Diffusion Models

Privacy-Preserving Adaptive Re-Identification without Image Transfer

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks

Spline-based Transformers

Anytime Continual Learning for Open Vocabulary Classification

Weighted Ensemble Models Are Strong Continual Learners

COD: Learning Conditional Invariant Representation for Domain Adaptation Regression

On the Topology Awareness and Generalization Performance of Graph Neural Networks

Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning

Model Stock: All we need is just a few fine-tuned models

A Direct Approach to Viewing Graph Solvability

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures

Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering

Shape from Heat Conduction

Rasterized Edge Gradients: Handling Discontinuities Differentially

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Loc3Diff: Local Diffusion for 3D Human Head Synthesis and Editing

PAV: Personalized Head Avatar from Unstructured Video Collection

Expressive Whole-Body 3D Gaussian Avatar

High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering

Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement

Image Demoireing in RAW and sRGB Domains

Multiscale Sliced Wasserstein Distances as Perceptual Color Difference Measures

Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy

Single-Mask Inpainting for Voxel-based Neural Radiance Fields

IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination

DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly

NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis

CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering

2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction

Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction

Deep Polarization Cues for Single-shot Shape and Subsurface Scattering Estimation

High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs

Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction

MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views

Dual-Camera Smooth Zoom on Mobile Phones

6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing

Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians

CompGS: Smaller and Faster Gaussian Splatting with Vector Quantization

Segmentation-guided Layer-wise Image Vectorization with Gradient Fills

EpipolarGAN: Omnidirectional Image Synthesis with Explicit Camera Control

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation

GenRC: Generative 3D Room Completion from Sparse Image Collections

Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval

Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting

GVGEN: Text-to-3D Generation with Volumetric Representation

VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation

DreamReward: Aligning Human Preference in Text-to-3D Generation

SemanticHuman-HD: High Resolution Semantic disentangled 3D Human Generation

Disentangled Clothed Avatar Generation from Text Descriptions

StructLDM: Structured Latent Diffusion for 3D Human Generation

High-Fidelity Modeling of Generalizable Wrinkle Deformation

ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos

Physics-Based Interaction with 3D Objects via Video Generation

Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder

Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors

Self-supervised Shape Completion via Involution and Implicit Correspondences

Self-Training Room Layout via Geometry-aware Ray-casting

DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting

GaussReg: Fast 3D Registration with Gaussian Splatting

AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion

PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration

DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding

ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention

SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds

MAD-DR: Map Compression for Visual Localization with Matchness Aware Descriptor Dimension Reduction

Tensorial template matching for fast cross-correlation with rotations and its application for tomography

Flowed Time of Flight Radiance Fields

Zero-Shot Image Feature Consensus with Deep Functional Maps

RSL-BA: Rolling Shutter Line Bundle Adjustment

How Far Can a 1-Pixel Camera Go? Solving Vision Tasks using Photoreceptors and Computationally Designed Visual Morphology

Hyperion – A fast, versatile symbolic Gaussian Belief Propagation framework for Continuous-Time SLAM

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps

iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning

PACE: Pose Annotations in Cluttered Environments

Global-to-Pixel Regression for Human Mesh Recovery

3D Hand Pose Estimation in Everyday Egocentric Images

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale

Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation

CliffPhys: Camera-based Respiratory Measurement using Clifford Neural Networks

Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

Deep Patch Visual SLAM

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

Image-to-Lidar Relational Distillation for Autonomous Driving Data

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing

Hetecooper: Feature Collaboration Graph for Heterogeneous Collaborative Perception

LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping

Probabilistic Image-Driven Traffic Modeling via Remote Sensing

Occupancy as Set of Points

Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation

Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction

Online Vectorized HD Map Construction using Geometry

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving

Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation

Learning to Drive via Asymmetric Self-Play

Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

I Can't Believe It's Not Scene Flow!

Motion and Structure from Event-based Normal Flow

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation

UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation

IAM-VFI : Interpolate Any Motion for Video Frame Interpolation with motion complexity map

Human Motion Forecasting in Dynamic Domain Shifts: A Homeostatic Continual Test-time Adaptation Framework

DIM: Dyadic Interaction Modeling for Social Behavior Generation

Length-Aware Motion Synthesis via Latent Diffusion

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Explorative Inbetweening of Time and Space

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

Pix2Gif: Motion-Guided Diffusion for GIF Generation

Factorizing Text-to-Video Generation by Explicit Image Conditioning

DNI: Dilutional Noise Initialization for Diffusion Video Editing

DATENeRF: Depth-Aware Text-based Editing of NeRFs

FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Using My Artistic Style? You Must Obtain My Authorization

Learned Image Enhancement via Color Naming

Region-Native Visual Tokenization

Improving image synthesis with diffusion-negative sampling

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

Visual Text Generation in the Wild

ReCON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

Navigating Text-to-Image Generative Bias across Indic Languages

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

LCM-Lookahead for Encoder-based Text-to-Image Personalization

Robust-Wide: Robust Watermarking against Instruction-driven Image Editing

COIN-Matting: Confounder Intervention for Image Matting

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

Data Augmentation via Latent Diffusion for Saliency Prediction

Score Distillation Sampling with Learned Manifold Corrective

Thinking Outside the BBox: Unconstrained Generative Object Compositing

Learning Quantized Adaptive Conditions for Diffusion Models

FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Lossy Image Compression with Foundation Diffusion Models

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images

MetaWeather: Few-Shot Weather-Degraded Image Restoration

Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization

Spatially-Variant Degradation Model for Dataset-free Super-resolution

Towards Architecture-Agnostic Untrained Networks Priors for Image Reconstruction with Frequency Regularization

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution

Image-adaptive 3D Lookup Tables for Real-time Image Enhancement with Bilateral Grids

Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context

denoiSplit: a method for joint microscopy image splitting and unsupervised denoising

Region-Aware Sequence-to-Sequence Learning for Hyperspectral Denoising

CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems

Plug-and-Play Learned Proximal Trajectory for 3D Sparse-View X-Ray Computed Tomography

Unsupervised Multi-modal Medical Image Registration via Invertible Translation

Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization

Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model

Finding a needle in a haystack: A Black-Box Approach to Invisible Watermark Detection

CriSp: Leveraging Tread Depth Maps for Enhanced Crime-Scene Shoeprint Matching

Noise-assisted Prompt Learning for Image Forgery Detection and Localization

TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing

Towards Certifiably Robust Face Recognition

Oulu Remote-photoplethysmography Physical Domain Attacks Database (ORPDAD)

Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement

Affine steerers for structured keypoint description

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

TAPTR: Tracking Any Point with Transformers as Detection

SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow

Towards Physical World Backdoor Attacks against Skeleton Action Recognition

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Two-Stage Active Learning for Efficient Temporal Action Segmentation

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network

PALM: Predicting Actions through Language Models

ZeroI2V: Zero-Cost Adaptation of Pre-Trained Transformers from Image to Video

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

VideoMamba: Spatio-Temporal Selective State Space Model

Text-Guided Video Masked Autoencoder

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

VISA: Reasoning Video Object Segmentation via Large Language Model

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark

Audio-visual Generalized Zero-shot Learning the Easy Way

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

SignGen: End-to-End Sign Language Video Generation with Latent Diffusion

TrajPrompt: Aligning Color Trajectory with Vision-Language Representations

Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification

OmniSat: Self-Supervised Modality Fusion for Earth Observation

Statewide Visual Geolocalization in the Wild

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts

An Explainable Vision Question Answer Model via Diffusion Chain-of-Thought

Fully Authentic Visual Question Answering Dataset from Online Communities

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs

TrojVLM: Backdoor Attack Against Vision Language Models

Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

Attention Prompting on Image for Large Vision-Language Models

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension

GTMS: A Gradient-driven Tree-guided Mask-free Referring Image Segmentation Method

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Think before Placement: Common Sense Enhanced Transformer for Object Placement

Eliminating Feature Ambiguity for Few-Shot Segmentation

Diffusion-Guided Weakly Supervised Semantic Segmentation

Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

Better Call SAL: Towards Learning to Segment Anything in Lidar

MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation

Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation

Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models

ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

EcoMatcher: Efficient Clustering Oriented Matcher for Detector-free Image Matching

Class-Agnostic Object Counting with Text-to-Image Diffusion Model

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection

Plain-Det: A Plain Multi-Dataset Object Detector

Multi-scale Cross Distillation for Object Detection in Aerial Images

PDT Uav Target Detection Dataset for Pests and Diseases Tree

Region-Adaptive Transform with Segmentation Prior for Image Compression

FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification

CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation

Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model

DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network

An Incremental Unified Framework for Small Defect Inspection

Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection

GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features

MoEAD: A Parameter-efficient Model for Multi-class Anomaly Detection

PQ-SAM: Post-training Quantization for Segment Anything Model

BKDSNN: Enhancing the Performance of Learning-based Spiking Neural Networks Training with Blurred Knowledge Distillation

ELSE: Efficient Deep Neural Network Inference through Line-based Sparsity Exploration

FairViT: Fair Vision Transformer via Adaptive Masking

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Characterizing Model Robustness via Natural Input Gradients

Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning

FreeAugment: Data Augmentation Search Across All Degrees of Freedom

Towards Multi-modal Transformers in Federated Learning

Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

Soft Prompt Generation for Domain Generalization

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Deep Online Probability Aggregation Clustering

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

An accurate detection is not all you need to combat label noise in web-noisy datasets

Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration

ExMatch: Self-guided Exploitation for Semi-Supervised Learning with Scarce Labeled Samples

Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement

SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery

Dynamic Retraining-Updating Mean Teacher for Source-Free Object Detection

Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization

Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence

On the Approximation Risk of Few-Shot Class-Incremental Learning

STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay

RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-based Continual Learning

CLEO: Continual Learning of Evolving Ontologies

Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning

Improving Knowledge Distillation via Regularizing Feature Direction and Norm

MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution

Federated Learning with Local Openset Noisy Labels

Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents

FedHARM: Harmonizing Model Architectural Diversity in Federated Learning

Causal Subgraphs and Information Bottlenecks: Redefining OOD Robustness in Graph Neural Networks

Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks

Shedding More Light on Robust Classifiers under the lens of Energy-based Models

Inter-Class Topology Alignment for Efficient Black-Box Substitute Attacks

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

FedHide: Federated Learning by Hiding in the Neighbors

SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks

Data Poisoning Quantization Backdoor Attack

Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics

Generalizing to Unseen Domains via Text-guided Augmentation

Event Trojan: Asynchronous Event-based Backdoor Attacks

(ends 12:30 PM)

Break:

Coffee Break

(ends 11:00 AM)