Skip to yearly menu bar Skip to main content


Show Detail Timezone:
America/Los_Angeles
 
Filter Rooms:  

SUN 29 SEP
midnight
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Tutorial:
(ends 4:00 AM)
5 a.m.
Workshop:
(ends 9:00 AM)
Tutorial:
(ends 9:00 AM)

MON 30 SEP
midnight
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
5 a.m.
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)

TUE 1 OCT
midnight
Orals 12:00-1:20
[12:00] Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection
[12:10] Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging
[12:20] SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow
[12:30] Photon Inhibition for Energy-Efficient Single-Photon Imaging
[12:40] Minimalist Vision with Freeform Pixels
[12:50] Flying with Photons: Rendering Novel Views of Propagating Light
[1:00] A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging
[1:10] GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
(ends 1:30 AM)
Orals 12:00-1:20
[12:00] Towards Scene Graph Anticipation
[12:10] OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation
[12:20] PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
[12:30] Bi-directional Contextual Attention for 3D Dense Captioning
[12:40] OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
[12:50] ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting
[1:00] A Fair Ranking and New Model for Panoptic Scene Graph Generation
[1:10] Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
(ends 1:30 AM)
Orals 12:00-1:20
[12:00] Making Large Language Models Better Planners with Reasoning-Decision Alignment
[12:10] MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping
[12:20] M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation
[12:30] H-V2X: A Large Scale Highway Dataset for BEV Perception
[12:40] Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
[12:50] DriveLM: Driving with Graph Visual Question Answering
[1:00] RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios
[1:10] Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks
(ends 1:30 AM)
1:30 a.m.
Posters 1:30-3:30
(ends 3:30 AM)
4:30 a.m.
Orals 4:30-6:20
[4:30] EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
[4:40] TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
[4:50] LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
[5:00] FlashTex: Fast Relightable Mesh Texturing with LightControlNet
[5:10] TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
[5:20] LLMGA: Multimodal Large Language Model based Generation Assistant
[5:30] Accelerating Image Generation with Sub-path Linear Approximation Model
[5:40] SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation
[5:50] Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture
[6:00] Zero-Shot Detection of AI-Generated Images
[6:10] Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
(ends 6:30 AM)
Orals 4:30-6:20
[4:30] Efficient Bias Mitigation Without Privileged Information
[4:40] Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation
[4:50] MobileNetV4: Universal Models for the Mobile Ecosystem
[5:00] Momentum Auxiliary Network for Supervised Local Learning
[5:10] From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition
[5:20] Dataset Enhancement with Instance-Level Augmentations
[5:30] Adaptive Parametric Activation
[5:40] Relation DETR: Exploring Explicit Position Relation Prior for Object Detection
[5:50] Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation
[6:00] CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
[6:10] On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines
(ends 6:30 AM)
Orals 4:30-6:20
[4:30] Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition
[4:40] COMO: Compact Mapping and Odometry
[4:50] Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss
[5:00] ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation
[5:10] SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments
[5:20] Six-Point Method for Multi-Camera Systems with Reduced Solution Space
[5:30] Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
[5:40] Grounding Image Matching in 3D with MASt3R
[5:50] ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images
[6:00] Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection
[6:10] Camera Calibration using a Collimator System
(ends 6:30 AM)
6:30 a.m.
Keynote:
Lourdes Agapito · Vittorio Ferrari
(ends 7:30 AM)
7:30 a.m.
Posters 7:30-9:30
(ends 9:30 AM)

WED 2 OCT
midnight
Orals 12:00-1:20
[12:00] PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
[12:10] Self-Supervised Video Desmoking for Laparoscopic Surgery
[12:20] CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos
[12:30] Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction
[12:40] Adaptive Correspondence Scoring for Unsupervised Medical Image Registration
[12:50] Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View
[1:00] SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images
[1:10] Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
(ends 1:30 AM)
Orals 12:00-1:20
[12:00] HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation
[12:10] PointLLM: Empowering Large Language Models to Understand Point Clouds
[12:20] RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation
[12:30] DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment
[12:40] KeypointDETR: An End-to-End 3D Keypoint Detector
[12:50] Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather
[1:00] RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation
[1:10] Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration
(ends 1:30 AM)
Orals 12:00-1:20
[12:00] PetFace: A Large-Scale Dataset and Benchmark for Animal Identification
[12:10] UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
[12:20] Towards Model-Agnostic Dataset Condensation by Heterogeneous Models
[12:30] Parrot Captions Teach CLIP to Spot Text
[12:40] Towards Open-ended Visual Quality Comparison
[12:50] VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking
[1:00] Insect Identification in the Wild: The AMI Dataset
[1:10] MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description
(ends 1:30 AM)
1:30 a.m.
Posters 1:30-3:30
(ends 3:30 AM)
4:30 a.m.
Orals 4:30-6:20
[4:30] AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
[4:40] Sapiens: Foundation for Human Vision Models
[4:50] POET: Prompt Offset Tuning for Continual Human Action Adaptation
[5:00] Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation
[5:10] SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
[5:20] UGG: Unified Generative Grasping
[5:30] NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
[5:40] Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models
[5:50] LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment
[6:00] Controllable Human-Object Interaction Synthesis
[6:10] NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction
(ends 6:30 AM)
Orals 4:30-6:20
[4:30] Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
[4:40] Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering
[4:50] Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration
[5:00] FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information
[5:10] RaFE: Generative Radiance Fields Restoration
[5:20] Watch Your Steps: Local Image and Scene Editing by Text Instructions
[5:30] MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
[5:40] RPBG: Towards Robust Neural Point-based Graphics in the Wild
[5:50] Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
[6:00] Learning 3D-aware GANs from Unposed Images with Template Feature Field
[6:10] MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition
(ends 6:30 AM)
Orals 4:30-6:20
[4:30] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
[4:40] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
[4:50] Efficient Neural Video Representation with Temporally Coherent Modulation
[5:00] Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation
[5:10] Video Editing via Factorized Diffusion Distillation
[5:20] ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
[5:30] Audio-Synchronized Visual Animation
[5:40] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
[5:50] MotionDirector: Motion Customization of Text-to-Video Diffusion Models
[6:00] ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model
[6:10] Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction
(ends 6:30 AM)
7:30 a.m.
Posters 7:30-9:30
(ends 9:30 AM)
8 a.m.
Demonstration:
(ends 12:00 PM)
Demonstration:
(ends 12:00 PM)

THU 3 OCT
midnight
Orals 12:00-1:20
[12:00] WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
[12:10] AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation
[12:20] CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model
[12:30] Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
[12:40] Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels
[12:50] ActionVOS: Actions as Prompts for Video Object Segmentation
[1:00] Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
[1:10] Diffusion Models for Open-Vocabulary Segmentation
(ends 1:30 AM)
Orals 12:00-1:20
[12:00] PiTe: Pixel-Temporal Alignment for Large Video-Language Model
[12:10] Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization
[12:20] Emergent Visual-Semantic Hierarchies in Image-Text Representations
[12:30] Learning Multimodal Latent Generative Models with Energy-Based Prior
[12:40] Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
[12:50] SINDER: Repairing the Singular Defects of DINOv2
[1:00] Denoising Vision Transformers
[1:10] Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking
(ends 1:30 AM)
Orals 12:00-1:20
[12:00] Robust Fitting on a Gate Quantum Computer
[12:10] Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views
[12:20] Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance
[12:30] MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
[12:40] Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception
[12:50] Faceptor: A Generalist Model for Face Perception
[1:00] A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability
[1:10] COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation
(ends 1:30 AM)
1:30 a.m.
Posters 1:30-3:30
(ends 3:30 AM)
4:30 a.m.
Orals 4:30-6:20
[4:30] Controlling the World by Sleight of Hand
[4:40] Pyramid Diffusion for Fine 3D Large Scene Generation
[4:50] FMBoost: Boosting Latent Diffusion with Flow Matching
[5:00] ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction
[5:10] Exact Diffusion Inversion via Bidirectional Integration Approximation
[5:20] Tackling Structural Hallucination in Image Translation with Local Diffusion
[5:30] Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems
[5:40] Adversarial Diffusion Distillation
[5:50] Arc2Face: A Foundation Model for ID-Consistent Human Faces
[6:00] Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning
[6:10] OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model
(ends 6:30 AM)
Orals 4:30-6:20
[4:30] GiT: Towards Generalist Vision Transformer through Universal Language Interface
[4:40] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
[4:50] Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
[5:00] MMBENCH: Is Your Multi-Modal Model an All-around Player?
[5:10] Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
[5:20] Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation
[5:30] A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars
[5:40] HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
[5:50] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
[6:00] uCAP: An Unsupervised Prompting Method for Vision-Language Models
[6:10] BRAVE: Broadening the visual encoding of vision-language models
(ends 6:30 AM)
Orals 4:30-6:20
[4:30] E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation
[4:40] Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos
[4:50] Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
[5:00] MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment
[5:10] C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
[5:20] LongVLM: Efficient Long Video Understanding via Large Language Models
[5:30] Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
[5:40] Towards Neuro-Symbolic Video Understanding
[5:50] Classification Matters: Improving Video Action Detection with Class-Specific Attention
[6:00] DEVIAS: Learning Disentangled Video Representations of Action and Scene
[6:10] Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
(ends 6:30 AM)
6:30 a.m.
Keynote:
Sanmi Koyejo
(ends 7:30 AM)
7:30 a.m.
Posters 7:30-9:30
(ends 9:30 AM)
11:30 p.m.
Orals 11:30-1:10
[11:30] A Direct Approach to Viewing Graph Solvability
[11:40] Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees
[11:50] Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering
[12:00] A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
[12:10] Physics-Based Interaction with 3D Objects via Video Generation
[12:20] Shape from Heat Conduction
[12:30] Rasterized Edge Gradients: Handling Discontinuities Differentially
[12:40] ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
[12:50] Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
[1:00] Model Stock: All we need is just a few fine-tuned models
(ends 1:30 AM)
Orals 11:30-1:00
[11:30] Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks
[11:40] Adversarial Robustification via Text-to-Image Diffusion Models
[11:50] Flatness-aware Sequential Learning Generates Resilient Backdoors
[12:00] A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks
[12:10] Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks
[12:20] R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
[12:30] Privacy-Preserving Adaptive Re-Identification without Image Transfer
[12:40] Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
[12:50] Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
(ends 1:30 AM)
Orals 11:30-1:10
[11:30] On the Topology Awareness and Generalization Performance of Graph Neural Networks
[11:40] Improving Knowledge Distillation via Regularizing Feature Direction and Norm
[11:50] Spline-based Transformers
[12:00] Anytime Continual Learning for Open Vocabulary Classification
[12:10] Weighted Ensemble Models Are Strong Continual Learners
[12:20] COD: Learning Conditional Invariant Representation for Domain Adaptation Regression
[12:30] Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning
[12:40] Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild
[12:50] Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
[1:00] HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
(ends 1:30 AM)

FRI 4 OCT
1:30 a.m.
Posters 1:30-3:30
(ends 3:30 AM)