ECCV 2024 Papers

Skip to yearly menu bar Skip to main content

Layout:

mini compact topic detail

4D Contrastive Superflows are Dense 3D Representation Learners

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Modeling and Driving Human Body Soundfields through Acoustic Primitives

Motion Mamba: Efficient and Long Sequence Motion Generation

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

SAGS: Structure-Aware 3D Gaussian Splatting

MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views

Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition

MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

BRAVE: Broadening the visual encoding of vision-language models

Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation

SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance

OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations

MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation

High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs

AFreeCA: Annotation-Free Counting for All

Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Motion and Structure from Event-based Normal Flow

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation

GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Merlin: Empowering Multimodal LLMs with Foresight Minds

E.T. the Exceptional Trajectory: Text-to-camera-trajectory generation with character awareness

Nuvo: Neural UV Mapping for Unruly 3D Representations

Towards Neuro-Symbolic Video Understanding

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Diffusion Bridges for 3D Point Cloud Denoising

AttnZero: Efficient Attention Discovery for Vision Transformers

Auto-GAS: Automated Proxy Discovery for Training-free Generative Architecture Search

Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search

Spectral Subsurface Scattering for Material Classification

HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

CarFormer: Self-Driving with Learned Object-Centric Representations

Text-Guided Video Masked Autoencoder

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting

Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

EvSign: Sign Language Recognition and Translation with Streaming Events

MetaAug: Meta-Data Augmentation for Post-Training Quantization

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

UNIKD: UNcertainty-Filtered Incremental Knowledge Distillation for Neural Implicit Representation

PartSTAD: 2D-to-3D Part Segmentation Task Adaptation

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Cross-Input Certified Training for Universal Perturbations

Rethinking and Improving Visual Prompt Selection for In-Context Learning Segmentation Framework

LiDAR-Event Stereo Fusion with Hallucinations

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation

Revisiting Supervision for Continual Representation Learning

Dolphins: Multimodal Language Model for Driving

MMBENCH: Is Your Multi-Modal Model an All-around Player?

HUMOS: Human Motion Model Conditioned on Body Shape

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds

Unsupervised Exposure Correction

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

External Knowledge Enhanced 3D Scene Generation from Sketch

GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Frequency-Spatial Entanglement Learning for Camouflaged Object Detection

3D Congealing: 3D-Aware Image Alignment in the Wild

Adversarial Robustification via Text-to-Image Diffusion Models

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective

Benchmarking the Robustness of Cross-view Geo-localization Models

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Model Stock: All we need is just a few fine-tuned models

Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision

Formula-Supervised Visual-Geometric Pre-training

MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment

DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding

Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing

Robust Fitting on a Gate Quantum Computer

Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics

Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement

Large-scale Reinforcement Learning for Diffusion Models

RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation

3D Single-object Tracking in Point Clouds with High Temporal Variation

Self-supervised Shape Completion via Involution and Implicit Correspondences

Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization

Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems

Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion

iHuman: Instant Animatable Digital Humans From Monocular Videos

LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers

HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression

Energy-induced Explicit quantification for Multi-modality MRI fusion

Characterizing Model Robustness via Natural Input Gradients

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

Resolving Scale Ambiguity in Multi-view 3D Reconstruction using Dual-Pixel Sensors

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

BugNIST - a Large Volumetric Dataset for Detection under Domain Shift

Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

See and Think: Embodied Agent in Virtual Environment

Scalar Function Topology Divergence: Comparing Topology of 3D Objects

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

Towards Robust Full Low-bit Quantization of Super Resolution Networks

When Do We Not Need Larger Vision Models?

Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

GVGEN: Text-to-3D Generation with Volumetric Representation

Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

UNIC: Universal Classification Models via Multi-teacher Distillation

MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References

ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

PointNeRF++: A multi-scale, point-based Neural Radiance Field

Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Differentiable Convex Polyhedra Optimization from Multi-view Images

WHAC: World-grounded Humans and Cameras

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

V-IRL: Grounding Virtual Intelligence in Real Life

SENC: Handling Self-collision in Neural Cloth Simulation

TrojVLM: Backdoor Attack Against Vision Language Models

Dataset Growth

m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

ReMamber: Referring Image Segmentation with Mamba Twister

Plain-Det: A Plain Multi-Dataset Object Detector

Pix2Gif: Motion-Guided Diffusion for GIF Generation

OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models

Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization

Plug-and-Play Learned Proximal Trajectory for 3D Sparse-View X-Ray Computed Tomography

LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

Beta-Tuned Timestep Diffusion Model

Bayesian Evidential Deep Learning for Online Action Detection

Local All-Pair Correspondence for Point Tracking

Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations

SEED: A Simple and Effective 3D DETR in Point Clouds

Intrinsic Single-Image HDR Reconstruction

DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution

Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification

LaRa: Efficient Large-Baseline Radiance Fields

XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

MobileNetV4: Universal Models for the Mobile Ecosystem

Efficient Snapshot Spectral Imaging: Calibration-Free Parallel Structure with Aperture Diffraction Fusion

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation

MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection

Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather

DiffiT: Diffusion Vision Transformers for Image Generation

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Prioritized Semantic Learning for Zero-shot Instance Navigation

Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats

Can OOD Object Detectors Learn from Foundation Models?

2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction

RadEdit: stress-testing biomedical vision models via diffusion image editing

Towards Real-world Event-guided Low-light Video Enhancement and Deblurring

Referring Atomic Video Action Recognition

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

MyVLM: Personalizing VLMs for User-Specific Queries

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

AMEGO: Active Memory from long EGOcentric videos

Camera-LiDAR Cross-modality Gait Recognition

Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction

Adaptive Correspondence Scoring for Unsupervised Medical Image Registration

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

An Adaptive Screen-Space Meshing Approach for Normal Integration

Collaborative Control for Geometry-Conditioned PBR Image Generation

Open-set Domain Adaptation via Joint Error based Multi-class Positive and Unlabeled Learning

Quantization-Friendly Winograd Transformations for Convolutional Neural Networks

Look Around and Learn: Self-Training Object Detection by Exploration

Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model

Regularizing Dynamic Radiance Fields with Kinematic Fields

SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving

Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss

DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators

Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks

Large Motion Model for Unified Multi-Modal Motion Generation

Memory-Efficient Fine-Tuning for Quantized Diffusion Model

WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians

Label-anticipated Event Disentanglement for Audio-Visual Video Parsing

Unified Local-Cloud Decision-Making via Reinforcement Learning

Think before Placement: Common Sense Enhanced Transformer for Object Placement

The Hard Positive Truth about Vision-Language Compositionality

Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

Concise Plane Arrangements for Low-Poly Surface and Volume Modelling

Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting

Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion

Wavelength-Embedding-guided Filter-Array Transformer for Spectral Demosaicing

GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views

Efficient Bias Mitigation Without Privileged Information

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Towards Open-Ended Visual Recognition with Large Language Models

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

On the Utility of 3D Hand Poses for Action Recognition

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

IRGen: Generative Modeling for Image Retrieval

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow

VISA: Reasoning Video Object Segmentation via Large Language Model

Learning Representations of Satellite Images From Metadata Supervision

Adaptive Parametric Activation

Scaling Backwards: Minimal Synthetic Pre-training?

Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction

Towards Multi-modal Transformers in Federated Learning

Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics

InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning

FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information

General and Task-Oriented Video Segmentation

Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

Benchmarking Object Detectors with COCO: A New Path Forward

Diffusion Model is a Good Pose Estimator from 3D RF-Vision

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues

Grounding Language Models for Visual Entity Recognition

Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy

Learning 3D-aware GANs from Unposed Images with Template Feature Field

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

DεpS: Delayed ε-Shrinking for Faster Once-For-All Training

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Human Hair Reconstruction with Strand-Aligned 3D Gaussians

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection

Global-to-Pixel Regression for Human Mesh Recovery

CIC-BART-SSA: : Controllable Image Captioning with Structured Semantic Augmentation

Rethinking Image Super Resolution from Training Data Perspectives

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Interactive 3D Object Detection with Prompts

Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams

Neural Volumetric World Models for Autonomous Driving

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation

Uni3DL: A Unified Model for 3D Vision-Language Understanding

G3R: Gradient Guided Generalizable Reconstruction

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning

HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

Invertible Neural Warp for NeRF

AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale

Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

Language-Image Pre-training with Long Captions

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference

CoReS: Orchestrating the Dance of Reasoning and Segmentation

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models

I Can't Believe It's Not Scene Flow!

Compress3D: a Compressed Latent Space for 3D Generation from a Single Image

Bi-directional Contextual Attention for 3D Dense Captioning

Scalable Group Choreography via Variational Phase Manifold Learning

Quality Assured: Rethinking Annotation Strategies in Imaging AI

Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels

TPA3D: Triplane Attention for Fast Text-to-3D Generation

Augmented Neural Fine-tuning for Efficient Backdoor Purification

Human Pose Recognition via Occlusion-Preserving Abstract Images

AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling

Retrieval Robust to Object Motion Blur

Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction

Occlusion-Aware Seamless Segmentation

TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts

Diffusion Models for Open-Vocabulary Segmentation

Rethinking Unsupervised Outlier Detection via Multiple Thresholding

OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

Stream Query Denoising for Vectorized HD-Map Construction

Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting

WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation

Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Quantized Prompt for Efficient Generalization of Vision-Language Models

Modality Translation for Object Detection Adaptation without forgetting prior knowledge

How Video Meetings Change Your Expression

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Learning to Obstruct Few-Shot Image Classification over Restricted Classes

Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation

L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model

DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation

Distilling Diffusion Models into Conditional GANs

UMBRAE: Unified Multimodal Brain Decoding

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Multiscale Graph Texture Network

LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping

Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing

Blind image deblurring with noise-robust kernel estimation

Free-Viewpoint Video of Outdoor Sports Using a Drone

RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency

Binomial Self-compensation for Motion Error in Dynamic 3D Scanning

Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation

Momentum Auxiliary Network for Supervised Local Learning

HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion

Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation

Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains

PQ-SAM: Post-training Quantization for Segment Anything Model

COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

Improving Zero-Shot Generalization for CLIP with Variational Adapter

LaWa: Using Latent Space for In-Generation Image Watermarking

Topology-Preserving Downsampling of Binary Images

Cocktail Universal Adversarial Attack on Deep Neural Networks

Hypernetworks for Generalizable BRDF Representation

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

Classification Matters: Improving Video Action Detection with Class-Specific Attention

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Leveraging temporal contextualization for video action recognition

AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale

ZigMa: A DiT-style Zigzag Mamba Diffusion Model

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Data Collection-free Masked Video Modeling

Resilience of Entropy Model in Distributed Neural Networks

Implicit Concept Removal of Diffusion Models

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Restoring Images in Adverse Weather Conditions via Histogram Transformer

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis

G2fR: Frequency Regularization in Grid-based Feature Encoding Neural Radiance Fields

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Generating 3D House Wireframes with Semantics

SegPoint: Segment Any Point Cloud via Large Language Model

Navigation Instruction Generation with BEV Perception and Large Language Models

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally

Eliminating Feature Ambiguity for Few-Shot Segmentation

Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation

GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator

BLINK: Multimodal Large Language Models Can See but Not Perceive

PreLAR: World Model Pre-training with Learnable Action Representation

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions

FreestyleRet: Retrieving Images from Style-Diversified Queries

Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

ReGround: Improving Textual and Spatial Grounding at No Cost

CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos

Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation

Image Demoireing in RAW and sRGB Domains

Reliability in Semantic Segmentation: Can We Use Synthetic Data?

Prompting Future Driven Diffusion Model for Hand Motion Prediction

Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning

3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views

Lazy Diffusion Transformer for Interactive Image Editing

Robust Calibration of Large Vision-Language Adapters

Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation

Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training

AugDETR: Improving Multi-scale Learning for Detection Transformer

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

SIGMA: Sinkhorn-Guided Masked Video Modeling

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams

Understanding Physical Dynamics with Counterfactual World Modeling

SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild

VideoMamba: Spatio-Temporal Selective State Space Model

Text to Layer-wise 3D Clothed Human Generation

Fully Sparse 3D Occupancy Prediction

CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding

PointLLM: Empowering Large Language Models to Understand Point Clouds

Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation

Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

Spatially-Variant Degradation Model for Dataset-free Super-resolution

Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence

SUMix: Mixup with Semantic and Uncertain Information

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction

LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation

Upper-body Hierarchical Graph for Skeleton Based Emotion Recognition in Assistive Driving

Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction

Zero-Shot Detection of AI-Generated Images

Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training

Exploring Guided Sampling of Conditional GANs

TCC-Det: Temporarily consistent cues for weakly-supervised 3D detection

Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation

Kalman-Inspired Feature Propagation for Video Face Super-Resolution

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

VideoMamba: State Space Model for Efficient Video Understanding

Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models

Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection

FreeAugment: Data Augmentation Search Across All Degrees of Freedom

I2-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance

SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Lagrangian Hashing for Compressed Neural Field Representations

Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis

Gaze Target Detection Based on Head-Local-Global Coordination

3DSA:Multi-View 3D Human Pose Estimation With 3D Space Attention Mechanisms

An Economic Framework for 6-DoF Grasp Detection

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery

Multi-Label Cluster Discrimination for Visual Representation Learning

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks

Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds

RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions

Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences

Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective

Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

MagicEraser: Erasing Any Objects via Semantics-Aware Control

Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation

SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images

NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

3D Small Object Detection with Dynamic Spatial Pruning

Semantically Guided Representation Learning For Action Anticipation

MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory

ScanTalk: 3D Talking Heads from Unregistered Scans

FreeInit: Bridging Initialization Gap in Video Diffusion Models

Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching

Controllable Navigation Instruction Generation with Chain of Thought Prompting

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

SuperGaussian: Repurposing Video Models for 3D Super Resolution

Towards Model-Agnostic Dataset Condensation by Heterogeneous Models

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction

D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction

Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution

Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation

UniFS: Universal Few-shot Instance Perception with Point Representations

Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation

Physics-Based Interaction with 3D Objects via Video Generation

Taming Latent Diffusion Model for Neural Radiance Field Inpainting

Shedding More Light on Robust Classifiers under the lens of Energy-based Models

CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

Unleashing the Power of Prompt-driven Nucleus Instance Segmentation

FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions

3DEgo: 3D Editing on the Go!

Domain-adaptive Video Deblurring via Test-time Blurring

NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving

Progressive Pretext Task Learning for Human Trajectory Prediction

Hyperion – A fast, versatile symbolic Gaussian Belief Propagation framework for Continuous-Time SLAM

Isomorphic Pruning for Vision Models

Reprojection Errors as Prompts for Efficient Scene Coordinate Regression

GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

VideoClusterNet: Self-Supervised and Adaptive Face Clustering for Videos

Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack

Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

Unsupervised Multi-modal Medical Image Registration via Invertible Translation

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

View Selection for 3D Captioning via Diffusion Ranking

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

BeNeRF:Neural Radiance Fields from a Single Blurry Image and Event Stream

DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment

SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture

PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection

Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach

Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction

BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression

Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing

ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation

OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

Click Prompt Learning with Optimal Transport for Interactive Segmentation

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

3D Human Pose Estimation via Non-Causal Retentive Networks

6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry

Enhancing Tampered Text Detection through Frequency Feature Fusion and Decomposition

DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction

Learning Diffusion Models for Multi-View Anomaly Detection

Masked Angle-Aware Autoencoder for Remote Sensing Images

Multi-modal Relation Distillation for Unified 3D Representation Learning

LongVLM: Efficient Long Video Understanding via Large Language Models

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

Light-in-Flight for a World-in-Motion

Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels

Learning with Unmasked Tokens Drives Stronger Vision Learners

Efficient Training of Spiking Neural Networks with Multi-Parallel Implicit Stream Architecture

Deep Patch Visual SLAM

LiteSAM is Actually what you Need for segment Everything

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

Visual Prompting via Partial Optimal Transport

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

Pathformer3D: A 3D Scanpath Transformer for 360° Images

Visual Grounding for Object-Level Generalization in Reinforcement Learning

TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection

SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection

Asymmetric Mask Scheme for Self-Supervised Real Image Denoising

FlexAttention for Efficient High-Resolution Vision-Language Models

EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Human Motion Generation

Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation

PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving

Temporal Event Stereo via Joint Learning with Stereoscopic Flow

H-V2X: A Large Scale Highway Dataset for BEV Perception

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection

E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation

InstructIR: High-Quality Image Restoration Following Human Instructions

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

LayoutFlow: Flow Matching for Layout Generation

Making Large Language Models Better Planners with Reasoning-Decision Alignment

Continual Learning for Remote Physiological Measurement: Minimize Forgetting and Simplify Inference

PACE: Pose Annotations in Cluttered Environments

InfMAE: A Foundation Model in The Infrared Modality

Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

Robust Incremental Structure-from-Motion with Hybrid Features

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

UniCal: Unified Neural Sensor Calibration

Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

Synchronization of Projective Transformations

U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation

Insect Identification in the Wild: The AMI Dataset

Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

This Probably Looks Exactly Like That: An Invertible Prototypical Network

GenRC: Generative 3D Room Completion from Sparse Image Collections

Towards Open-ended Visual Quality Comparison

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Neural graphics texture compression supporting random access

Contrastive Learning with Synthetic Positives

GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features

DIM: Dyadic Interaction Modeling for Social Behavior Generation

ControlCap: Controllable Region-level Captioning

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians

Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning

A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties

Fast View Synthesis of Casual Videos with Soup-of-Planes

Confidence Self-Calibration for Multi-Label Class-Incremental Learning

Video Question Answering with Procedural Programs

DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting

SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields

Representation Enhancement-Stabilization: Reducing Bias-Variance of Domain Generalization

LLMGA: Multimodal Large Language Model based Generation Assistant

Shape from Heat Conduction

Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Towards High-Quality 3D Motion Transfer with Realistic Apparel Animation

AnyHome: Open-Vocabulary Large-Scale Indoor Scene Generation with First-Person View Exploration

Better Call SAL: Towards Learning to Segment Anything in Lidar

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

iMatching: Imperative Correspondence Learning

Appearance-based Refinement for Object-Centric Motion Segmentation

Open Panoramic Segmentation

Open Vocabulary Multi-Label Video Classification

Shape-guided Configuration-aware Learning for Endoscopic-image-based Pose Estimation of Flexible Robotic Instruments

MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Efficient Pre-training for Localized Instruction Generation of Procedural Videos

MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution

DEAL: Disentangle and Localize Concept-level Explanations for VLMs

RoadPainter: Points Are Ideal Navigators for Topology transformER

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

IMMA: Immunizing text-to-image Models against Malicious Adaptation

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow

GeoCalib: Learning Single-image Calibration with Geometric Optimization

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

ReMatching: Low-Resolution Representations for Scalable Shape Correspondence

Semicalibrated Relative Pose from an Affine Correspondence and Monodepth

Global Structure-from-Motion Revisited

Gravity-aligned Rotation Averaging with Circular Regression

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Quanta Video Restoration

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

A Probability-guided Sampler for Neural Implicit Surface Rendering

CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

POCA: Post-training Quantization with Temporal Alignment for Codec Avatars

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture

A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Audio-Synchronized Visual Animation

Expressive Whole-Body 3D Gaussian Avatar

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects

PAV: Personalized Head Avatar from Unstructured Video Collection

Strike a Balance in Continual Panoptic Segmentation

MultiDelete for Multimodal Machine Unlearning

Stitched ViTs are Flexible Vision Backbones

Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

TrajPrompt: Aligning Color Trajectory with Vision-Language Representations

Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis

CountFormer: Multi-View Crowd Counting Transformer

SemReg: Semantics Constrained Point Cloud Registration

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors

ActionVOS: Actions as Prompts for Video Object Segmentation

DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models

One-stage Prompt-based Continual Learning

Unsqueeze [CLS] Bottleneck to Learn Rich Representations

Robust Multimodal Learning via Representation Decoupling

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment

WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing

Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality

Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks

A Direct Approach to Viewing Graph Solvability

Effective Lymph Nodes Detection in CT Scans Using Location Debiased Query Selection and Contrastive Query Representation in Transformer

Look Hear: Gaze Prediction for Speech-directed Human Attention

Raising the Ceiling: Conflict-Free Local Feature Matching with Dynamic View Switching

Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Parrot Captions Teach CLIP to Spot Text

Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning

Solving Motion Planning Tasks with a Scalable Generative Model

Rotary Position Embedding for Vision Transformer

Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

ReNoise: Real Image Inversion Through Iterative Noising

Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment

Leveraging Thermal Modality to Enhance Reconstruction in Low-Light Conditions

PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

Recursive Visual Programming

Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

Learning to Adapt SAM for Segmenting Cross-domain Point Clouds

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Human-in-the-Loop Visual Re-ID for Population Size Estimation

Finding Visual Task Vectors

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Tensorial template matching for fast cross-correlation with rotations and its application for tomography

Event Camera Data Dense Pre-training

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

EgoLifter: Open-world 3D Segmentation for Egocentric Perception

MoVideo: Motion-Aware Video Generation with Diffusion Models

ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion

Where am I? Scene Retrieval with Language

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

RangeLDM: Fast Realistic LiDAR Point Cloud Generation

Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

Physically Plausible Color Correction for Neural Radiance Fields

Unifying 3D Vision-Language Understanding via Promptable Queries

LLM as Copilot for Coarse-grained Vision-and-Language Navigation

Revisiting Calibration of Wide-Angle Radially Symmetric Cameras

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

MAD-DR: Map Compression for Visual Localization with Matchness Aware Descriptor Dimension Reduction

A New Dataset and Framework for Real-World Blurred Images Super-Resolution

Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction

Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

Uncertainty-aware sign language video retrieval with probability distribution modeling

NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction

SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Adversarial Prompt Tuning for Vision-Language Models

BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering

A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks

CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation

Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning

Operational Open-Set Recognition and PostMax Refinement

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Text2Place: Affordance-aware Text Guided Human Placement

REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices

Self-Training Room Layout via Geometry-aware Ray-casting

TAPTR: Tracking Any Point with Transformers as Detection

Adaptive Multi-task Learning for Few-shot Object Detection

Closed-Loop Unsupervised Representation Disentanglement with $\beta$-VAE Distillation and Diffusion Probabilistic Feedback

ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model

Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration

CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection

Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

TC4D: Trajectory-Conditioned Text-to-4D Generation

RAW-Adapter: Adapting Pretrained Visual Model to Camera RAW Images

Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding

Dataset Enhancement with Instance-Level Augmentations

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

Personalized Federated Domain-Incremental Learning based on Adaptive Knowledge Matching

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery

SLIM: Spuriousness Mitigation with Minimal Human Annotations

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

X-Pose: Detecting Any Keypoints

MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition

∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction

Motion Aware Event Representation-driven Image Deblurring

Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

GroupDiff: Diffusion-based Group Portrait Editing

Privacy-Preserving Adaptive Re-Identification without Image Transfer

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt

TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views

Towards More Practical Group Activity Detection: A New Benchmark and Model

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Zero-Shot Image Feature Consensus with Deep Functional Maps

Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views

City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web

Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection

SeiT++: Masked Token Modeling Improves Storage-efficient Training

Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling

ProMerge: Prompt and Merge for Unsupervised Instance Segmentation

Open-Vocabulary Camouflaged Object Segmentation

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval

Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition

Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection

Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

DNI: Dilutional Noise Initialization for Diffusion Video Editing

Fully Authentic Visual Question Answering Dataset from Online Communities

Towards Physical World Backdoor Attacks against Skeleton Action Recognition

Active Generation for Image Classification

Panel-Specific Degradation Representation for Raw Under-Display Camera Image Restoration

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Diffusion-Guided Weakly Supervised Semantic Segmentation

DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

Real-time Holistic Robot Pose Estimation with Unknown States

Online Vectorized HD Map Construction using Geometry

Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation

Click-Gaussian: Interactive Segmentation to Any 3D Gaussians

Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion

Improving Virtual Try-On with Garment-focused Diffusion Models

MANIKIN: Biomechanically Accurate Neural Inverse Kinematics for Human Motion Estimation

Disentangled Generation and Aggregation for Robust Radiance Fields

MoAI: Mixture of All Intelligence for Large Language and Vision Models

SMooDi: Stylized Motion Diffusion Model

Online Temporal Action Localization with Memory-Augmented Transformer

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Learning Video Context as Interleaved Multimodal Sequences

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

FLAT: Flux-aware Imperceptible Adversarial Attacks on 3D Point Clouds

Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks

Multi-branch Collaborative Learning Network for 3D Visual Grounding

Progressive Proxy Anchor Propagation for Unsupervised Semantic Segmentation

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

Revisit Human-Scene Interaction via Space Occupancy

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model

Mitigating Background Shift in Class-Incremental Semantic Segmentation

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

BKDSNN: Enhancing the Performance of Learning-based Spiking Neural Networks Training with Blurred Knowledge Distillation

Object-Oriented Anchoring and Modal Alignment in Multimodal Learning

SPIRE: Semantic Prompt-Driven Image Restoration

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Towards Stable 3D Object Detection

FYI: Flip Your Images for Dataset Distillation

On-the-fly Category Discovery for LiDAR Semantic Segmentation

Dual-Camera Smooth Zoom on Mobile Phones

Attention Decomposition for Cross-Domain Semantic Segmentation

CONDA: Condensed Deep Association Learning for Co-Salient Object Detection.

PolyRoom: Room-aware Transformer for Floorplan Reconstruction

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Improving Video Segmentation via Dynamic Anchor Queries

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Diffusion Models as Optimizers for Efficient Planning in Offline RL

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image

Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

Flatness-aware Sequential Learning Generates Resilient Backdoors

PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning

Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning

An Incremental Unified Framework for Small Defect Inspection

Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training

Real-time 3D-aware Portrait Editing from a Single Image

Dolfin: Diffusion Layout Transformers without Autoencoder

Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

Emergent Visual-Semantic Hierarchies in Image-Text Representations

DriveLM: Driving with Graph Visual Question Answering

Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors

Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection

Real Appearance Modeling for More General Deepfake Detection

6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Event Trojan: Asynchronous Event-based Backdoor Attacks

V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception

VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing

GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding

Any2Point: Empowering Any-modality Transformers for Efficient 3D Understanding

HARIVO: Harnessing Text-to-Image Models for Video Generation

Deep Online Probability Aggregation Clustering

WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Length-Aware Motion Synthesis via Latent Diffusion

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

Free Lunch for Gait Recognition: A Novel Relation Descriptor

OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers

An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers

Disentangled Clothed Avatar Generation from Text Descriptions

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Exemplar-free Continual Representation Learning via Learnable Drift Compensation

Improving image synthesis with diffusion-negative sampling

AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation

SignGen: End-to-End Sign Language Video Generation with Latent Diffusion

Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

Idling Neurons, Appropriately Lenient Workload During Fine-tuning Leads to Better Generalization

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Accelerating Image Generation with Sub-path Linear Approximation Model

Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations

SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation

Camera Calibration using a Collimator System

GRA: Detecting Oriented Objects through Group-wise Rotating and Attention

Track Everything Everywhere Fast and Robustly

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

Label-free Neural Semantic Image Synthesis

Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment

FARSE-CNN: Fully Asynchronous, Recurrent and Sparse Event-Based CNN

ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images

Event-Aided Time-To-Collision Estimation for Autonomous Driving

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

The Devil is in the Statistics: Mitigating and Exploiting Statistics Difference for Generalizable Semi-supervised Medical Image Segmentation

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

VEON: Vocabulary-Enhanced Occupancy Prediction

Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models

HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images

Nonverbal Interaction Detection

The Sky's the Limit: Relightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

Neural Spectral Decomposition for Dataset Distillation

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Region-Adaptive Transform with Segmentation Prior for Image Compression

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

Cascade Prompt Learning for Visual-Language Model Adaptation

Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion

cDP-MIL: Robust Multiple Instance Learning via Cascaded Dirichlet Process

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition

Delving Deep into Engagement Prediction of Short Videos

CLEO: Continual Learning of Evolving Ontologies

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Leveraging scale- and orientation-covariant features for planar motion estimation

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

MultiGen: Zero-shot Image Generation from Multi-modal Prompts

Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction

Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network

AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

HERGen: Elevating Radiology Report Generation with Longitudinal Data

Labeled Data Selection for Category Discovery

Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation

Dependency-aware Differentiable Neural Architecture Search

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer

SNeRV: Spectra-preserving Neural Representation for Video

COMO: Compact Mapping and Odometry

SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

An Information Theoretical View for Out-Of-Distribution Detection

HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

SILC: Improving Vision Language Pretraining with Self-Distillation

Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction

DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic Scenes

Transferable 3D Adversarial Shape Completion using Diffusion Models

Gradient-Aware for Class-Imbalanced Semi-supervised Medical Image Segmentation

Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising

Event-Adapted Video Super-Resolution

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection

Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction

On Pretraining Data Diversity for Self-Supervised Learning

Bayesian Self-Training for Semi-Supervised 3D Segmentation

Tri^{2}-plane: Thinking Head Avatar via Feature Pyramid

Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal

ParCo: Part-Coordinating Text-to-Motion Synthesis

Learning to Complement and to Defer to Multiple Users

Tiny Models are the Computational Saver for Large Models

Multi-Sentence Grounding for Long-term Instructional Video

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Unveiling Privacy Risks in Stochastic Neural Networks Training: Effective Image Reconstruction from Gradients

Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Rate-Distortion-Cognition Controllable Versatile Neural Image Compression

Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Vista3D: unravel the 3d darkside of a single image

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Segment and Recognize Anything at Any Granularity

ST-LLM: Large Language Models Are Effective Temporal Learners

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

Exact Diffusion Inversion via Bidirectional Integration Approximation

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Object-Centric Diffusion for Efficient Video Editing

Single-Mask Inpainting for Voxel-based Neural Radiance Fields

Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval

SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Agglomerative Token Clustering

CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection

NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition

GIVT: Generative Infinite-Vocabulary Transformers

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density

Multi-Modal Video Dialog State Tracking in the Wild

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

Combining Generative and Geometry Priors for Wide-Angle Portrait Correction

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

StereoGlue: Joint Feature Matching and Robust Estimation

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction

Foster Adaptivity and Balance in Learning with Noisy Labels

Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM

AWOL: Analysis WithOut synthesis using Language

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Temporal Residual Jacobians for Rig-free Motion Transfer

Object-Aware NIR-to-Visible Translation

Taming Lookup Tables for Efficient Image Retouching

DualDn: Dual-domain Denoising via Differentiable ISP

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

NICP: Neural ICP for 3D Human Registration at Scale

Syn-to-Real Domain Adaptation for Point Cloud Completion via Part-based Approach

PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines

LiDAR-based All-weather 3D Object Detection via Prompting and Distilling 4D Radar

FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras

StableDrag: Stable Dragging for Point-based Image Editing

Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation

Scaling Up Personalized Image Aesthetic Assessment via Task Vector Customization

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context

Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching

Monocular Occupancy Prediction for Scalable Indoor Scenes

Neural Surface Detection for Unsigned Distance Fields

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes

Event-Based Motion Magnification

AdaIFL: Adaptive Image Forgery Localization via a Dynamic and Importance-aware Transformer Network

Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images

Towards Multimodal Sentiment Analysis Debiasing via Bias Purification

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos

MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty

PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

Event-based Head Pose Estimation: Benchmark and Method

UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

PSALM: Pixelwise Segmentation with Large Multi-modal Model

Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Robustness Tokens: Towards Adversarial Robustness of Transformers

DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images

DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

EINet: Point Cloud Completion via Extrapolation and Interpolation

Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases

Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation

ReCON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays

StyleCity: Large-Scale 3D Urban Scenes Stylization

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

ViG-Bias: Visually Grounded Bias Discovery and Mitigation

Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors

DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior

Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting

Relightable Neural Actor with Intrinsic Decomposition and Pose Control

Assessing Sample Quality via the Latent Space of Generative Models

Enhancing Vectorized Map Perception with Historical Rasterized Maps

Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model

Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Responsible Visual Editing

Consistent 3D Line Mapping

Distributed Active Client Selection With Noisy Clients Using Model Association Scores

PixOOD: Pixel-Level Out-of-Distribution Detection

SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Editable Image Elements for Controllable Synthesis

General Geometry-aware Weakly Supervised 3D Object Detection

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

EA-VTR: Event-Aware Video-Text Retrieval

GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns

POA: Pre-training Once for Models of All Sizes

Towards a Density Preserving Objective Function for Learning on Point Sets

VF-NeRF: Viewshed Fields for Rigid NeRF Registration

RSL-BA: Rolling Shutter Line Bundle Adjustment

Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction

Trainable Highly-expressive Activation Functions

MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation

RealViformer: Investigating Attention for Real-World Video Super-Resolution

Do text-free diffusion models learn discriminative visual representations?

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

Training-Free Model Merging for Multi-target Domain Adaptation

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Instant 3D Human Avatar Generation using Image Diffusion Models

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

DOCCI: Descriptions of Connected and Contrasting Images

Drag Anything: Motion Control for Anything using Entity Representation

RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

A Rotation-invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images

EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition

LogoSticker: Inserting Logos into Diffusion Models for Customized Generation

R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection

McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

LEROjD: Lidar Extended Radar-Only Object Detection

ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation

Probabilistic Image-Driven Traffic Modeling via Remote Sensing

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Semantic Residual Prompts for Continual Learning

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

Occupancy as Set of Points

UAV First-Person Viewers Are Radiance Field Learners

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

A Fair Ranking and New Model for Panoptic Scene Graph Generation

ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning

Situated Instruction Following

M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

Holodepth: Programmable Depth-Varying Projection via Computer-Generated Holography

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

GalLop: Learning global and local prompts for vision-language models

Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor

Two-Stage Video Shadow Detection via Temporal-Spatial Adaption

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization

Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation

Lossy Image Compression with Foundation Diffusion Models

UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

FMBoost: Boosting Latent Diffusion with Flow Matching

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

Shifted Autoencoders for Point Annotation Restoration in Object Counting

An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes

Kernel Diffusion: An Alternate Approach to Blind Deconvolution

FoundPose: Unseen Object Pose Estimation with Foundation Features

LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration

Diffusion Models as Data Mining Tools

Graph Neural Network Causal Explanation via Neural Causal Models

SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

Improving Adversarial Transferability via Model Alignment

RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

Embodied Understanding of Driving Scenarios

NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation

ViLA: Efficient Video-Language Alignment for Video Question Answering

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Factorizing Text-to-Video Generation by Explicit Image Conditioning

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

Open-Set Biometrics: Beyond Good Closed-Set Models

Osmosis: RGBD Diffusion Prior for Underwater Image Restoration

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Computing the Lipschitz constant needed for fast scene recovery from CASSI measurements

DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields

Flowed Time of Flight Radiance Fields

Cut out the Middleman: Revisiting Pose-based Gait Recognition

3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing

Fast Registration of Photorealistic Avatars for VR Facial Animation

CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs

FedHARM: Harmonizing Model Architectural Diversity in Federated Learning

Thinking Outside the BBox: Unconstrained Generative Object Compositing

EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Autonomous Driving

RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark

EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models

RICA^2: Rubric-Informed, Calibrated Assessment of Actions

Commonly Interesting Images

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

CriSp: Leveraging Tread Depth Maps for Enhanced Crime-Scene Shoeprint Matching

Caltech Aerial RGB-Thermal Dataset in the Wild

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

CityGuessr: City-Level Video Geo-Localization on a Global Scale

Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Revising Densification in Gaussian Splatting

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions

UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation

PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis

A Graph-Based Approach for Category-Agnostic Pose Estimation

Depth-guided NeRF Training via Earth Mover’s Distance

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks

Diagnosing and Re-learning for Balanced Multimodal Learning

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders

Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data

AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation

CARB-Net: Camera-Assisted Radar-Based Network for Vulnerable Road User Detection

SAH-SCI: Self-Supervised Adapter for Efficient Hyperspectral Snapshot Compressive Imaging

Minimalist Vision with Freeform Pixels

All You Need is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation

LatentEditor: Text Driven Local Editing of 3D Scenes

POET: Prompt Offset Tuning for Continual Human Action Adaptation

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

TrafficNight : An Aerial Multimodal Benchmark For Nighttime Vehicle Surveillance

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Loc3Diff: Local Diffusion for 3D Human Head Synthesis and Editing

Generative End-to-End Autonomous Driving

Learning to Distinguish Samples for Generalized Category Discovery

COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark

Diff-Reg: Diffusion Model in Doubly Stochastic Matrix Space for Registration Problem

WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning

Towards Dual Transparent Liquid Level Estimation in Biomedical Lab: Dataset, Methods and Practice

Encapsulating Knowledge in One Prompt

Delving into Adversarial Robustness on Document Tampering Localization

Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing

Confidence-Based Iterative Generation for Real-World Image Super-Resolution

Seeing Faces in Things: A Model and Dataset for Pareidolia

Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering

AMD: Automatic Multi-step Distillation of Large-scale Vision Models

FairViT: Fair Vision Transformer via Adaptive Masking

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Frugal 3D Point Cloud Model Training via Progressive Near Point Filtering and Fused Aggregation

HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation

Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction

Investigating Style Similarity in Diffusion Models

JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

MagicMirror: Fast and High-Quality Avatar Generation with Constrained Search Space

EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors

GTMS: A Gradient-driven Tree-guided Mask-free Referring Image Segmentation Method

SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Unmasking Bias in Diffusion Model Training

Multimodal Label Relevance Ranking via Reinforcement Learning

A Simple Background Augmentation Method for Object Detection with Diffusion Model

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization

Deep Polarization Cues for Single-shot Shape and Subsurface Scattering Estimation

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

An Explainable Vision Question Answer Model via Diffusion Chain-of-Thought

Fast Sprite Decomposition from Animated Graphics

Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection

IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection

PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation

CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs

UGG: Unified Generative Grasping

A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures

FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation

Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt

GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Robustness Preserving Fine-tuning using Neuron Importance

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Similarity of Neural Architectures using Adversarial Attack Transferability

Dual-Rain: Video Rain Removal using Assertive and Gentle Teachers

PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation

Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks

Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

Scene-Conditional 3D Object Stylization and Composition

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

Revisit Anything: Visual Place Recognition via Image Segment Retrieval

Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation

DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation

Self-Guided Generation of Minority Samples Using Diffusion Models

DEVIAS: Learning Disentangled Video Representations of Action and Scene

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Class-Agnostic Object Counting with Text-to-Image Diffusion Model

Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

Forbes: Face Obfuscation Rendering via Backpropagation Refinement Scheme

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Information Bottleneck Based Data Correction in Continual Learning

A Watermark-Conditioned Diffusion Model for IP Protection

Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation

SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning

FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation

On Spectral Properties of Gradient-based Explanation Methods

DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation

Generalizing to Unseen Domains via Text-guided Augmentation

Contextual Correspondence Matters: Bidirectional Graph Matching for Video Summarization

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

Adaptive Multi-head Contrastive Learning

Rotated Orthographic Projection for Self-Supervised 3D Human Pose Estimation

Easing 3D Pattern Reasoning with Side-view Features for Semantic Scene Completion

MO-EMT-NAS: Multi-Objective Continuous Transfer of Architectural Knowledge Between Tasks from Different Datasets

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

Adaptive Annealing for Robust Averaging

MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery

High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering

Early Anticipation of Driving Maneuvers

SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization

On the Evaluation Consistency of Attribution-based Explanations

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction

DreamReward: Aligning Human Preference in Text-to-3D Generation

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Towards Image Ambient Lighting Normalization

FedHide: Federated Learning by Hiding in the Neighbors

Self-Cooperation Knowledge Distillation for Novel Class Discovery

SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

A Comparative Study of Image Restoration Networks for General Backbone Network Design

HoloADMM: High-Quality Holographic Complex Field Recovery

Synthesizing Time-varying BRDFs via Latent Space

Fundamental Matrix Estimation Using Relative Depths

MTaDCS: Moving Trace and Feature Density-based Confidence Sample Selection under Label Noise

Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis

DataDream: Few-shot Guided Dataset Generation

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Weighted Ensemble Models Are Strong Continual Learners

GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time

Learning Equilibrium Transformation for Gamut Expansion and Color Restoration

Physics-informed Knowledge Transfer for Underwater Monocular Depth Estimation

Robust Nearest Neighbors for Source-Free Domain Adaptation under Class Distribution Shift

Chains of Diffusion Models

Feature Diversification and Adaptation for Federated Domain Generalization

TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling

Dataset Distillation by Automatic Training Trajectories

RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes

RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection

Learning Neural Deformation Representation for 4D Dynamic Shape Generation

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Domain Shifting: A Generalized Solution for Heterogeneous Cross-Modality Person Re-Identification

Self-Supervised Video Desmoking for Laparoscopic Surgery

Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining

Continuity Preserving Online CenterLine Graph Learning

Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping

MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections

Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection

AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking

HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

Oulu Remote-photoplethysmography Physical Domain Attacks Database (ORPDAD)

Leveraging Imperfect Restoration for Data Availability Attack

DoubleTake: Geometry Guided Depth Estimation

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models

Möbius Transform for Mitigating Perspective Distortions in Representation Learning

TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution Detection

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Continual Learning and Unknown Object Discovery in 3D Scenes via Self-Distillation

DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting

Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking

Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection

Region-Native Visual Tokenization

The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization

Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond

Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling

Multi-modal Crowd Counting via a Broker Modality

FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

MLPHand: Real Time Multi-View 3D Hand Reconstruction via MLP Modeling

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

How Far Can a 1-Pixel Camera Go? Solving Vision Tasks using Photoreceptors and Computationally Designed Visual Morphology

MONTRAGE: Monitoring Training for Attribution of Generative Diffusion Models

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

Self-supervised visual learning from interactions with objects

OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

BAFFLE: A Baseline of Backpropagation-Free Federated Learning

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Diverse Text-to-3D Synthesis with Augmented Text Embedding

LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang

AdversariaLeak: External Information Leakage Attack Using Adversarial Samples on Face Recognition Systems

SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation

Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier

Enhanced Sparsification via Stimulative Training

Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network

FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models

Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Spiking Wavelet Transformer

WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing

PDT Uav Target Detection Dataset for Pests and Diseases Tree

Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection

COD: Learning Conditional Invariant Representation for Domain Adaptation Regression

RANRAC: Robust Neural Scene Representations via Random Ray Consensus

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks

DQ-DETR: DETR with Dynamic Query for Tiny Object Detection

SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians

Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections

Few-shot Defect Image Generation based on Consistency Modeling

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Video Editing via Factorized Diffusion Distillation

Trackastra: Transformer-based cell tracking for live-cell microscopy

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring

Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring

ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

Curved Diffusion: A Generative Model With Optical Geometry Control

CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Skeleton Recall Loss for Connectivity Conserving and Resource Efficient Segmentation of Thin Tubular Structures

Conceptual Codebook Learning for Vision-Language Models

AnimateMe: 4D Facial Expressions via Diffusion Models

LingoQA: Video Question Answering for Autonomous Driving

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors

iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning

Context Diffusion: In-Context Aware Image Generation

Pose Guided Fine-Grained Sign Language Video Generation

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Certifiably Robust Image Watermark

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Online Zero-Shot Classification with CLIP

SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning

Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder

Weakly-Supervised 3D Hand Reconstruction with Knowledge Prior and Uncertainty Guidance

3D Reconstruction of Objects in Hands without Real World 3D Supervision

To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of Point Cloud Transfer Learning

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

Parameterized Quasi-Physical Simulators for Dexterous Manipulations Transfer

Optimization-based Uncertainty Attribution Via Learning Informative Perturbations

Semi-supervised Segmentation of Histopathology Images with Noise-Aware Topological Consistency

Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling

MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks

Explorative Inbetweening of Time and Space

A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control

Learning to Make Keypoints Sub-Pixel Accurate

Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images

Generalizable Human Gaussians for Sparse View Synthesis

Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off

GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction

AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation

PFedEdit: Personalized Federated Learning via Automated Model Editing

De-Confusing Pseudo-Labels in Source-Free Domain Adaptation

Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores

MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception

EpipolarGAN: Omnidirectional Image Synthesis with Explicit Camera Control

Photorealistic Video Generation with Diffusion Models

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding

Self-Supervised Audio-Visual Soundscape Stylization

Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

Source-Free Domain-Invariant Performance Prediction

Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures

Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort

Direct Distillation between Different Domains

GRiT: A Generative Region-to-text Transformer for Object Understanding

LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System

Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning

Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending

Geometry Fidelity for Spherical Images

BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling

CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

Free-Editor: Zero-shot Text-driven 3D Scene Editing

DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Generalizable Symbolic Optimizer Learning

Tackling Structural Hallucination in Image Translation with Local Diffusion

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

On the Vulnerability of Skip Connections to Model Inversion Attacks

Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector

Reinforcement Learning via Auxillary Task Distillation

DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay

Fairness-aware Vision Transformer via Debiased Self-Attention

Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update

SNP: Structured Neuron-level Pruning to Preserve Attention Scores

PALM: Predicting Actions through Language Models

Motion Keyframe Interpolation for Any Human Skeleton using Point Cloud-based Human Motion Data Homogenisation

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Improving Hyperbolic Representations via Gromov-Wasserstein Regularization

VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense

Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation

PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation

Efficient Training with Denoised Neural Weights

Integration of Global and Local Representations for Fine-grained Cross-modal Alignment

Local and Global Flatness for Federated Domain Generalization

SRPose: Two-view Relative Pose Estimation with Sparse Keypoints

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs

Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation

Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Efficient Vision Transformers with Partial Attention

Generalized Coverage for More Robust Low-Budget Active Learning

Rasterized Edge Gradients: Handling Discontinuities Differentially

Kinetic Typography Diffusion Model

Enhancing Cross-Subject fMRI-to-Video Decoding with Global-Local Functional Alignment

ZeroI2V: Zero-Cost Adaptation of Pre-Trained Transformers from Image to Video

Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion

Data Poisoning Quantization Backdoor Attack

T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

A high-quality robust diffusion framework for corrupted dataset

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Distilling Knowledge from Large-Scale Image Models for Object Detection

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

TimeLens-XL: Real-time Event-based Video Frame Interpolation with Large Motion

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets

Unsupervised Representation Learning by Balanced Self Attention Matching

Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic Imaging

Enhancing Source-Free Domain Adaptive Object Detection with Low-confidence Pseudo Label Distillation

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

SCOD: From Heuristics to Theory

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

Teach CLIP to Develop a Number Sense for Ordinal Regression

Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

Compact 3D Scene Representation via Self-Organizing Gaussian Grids

VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking

SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes

Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Towards Certifiably Robust Face Recognition

Linking in Style: Understanding learned features in deep learning models

Stable Video Portraits

CliffPhys: Camera-based Respiratory Measurement using Clifford Neural Networks

Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network

PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

Instant Uncertainty Calibration of NeRFs Using a Meta-Calibrator

SHIC: Shape-Image Correspondences with no Keypoint Supervision

Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection

Weight Conditioning for Smooth Optimization of Neural Networks

Energy-Clibrated VAE with Test Time Free Lunch

SceneTeller: Language-to-3D Scene Generation

MagMax: Leveraging Model Merging for Seamless Continual Learning

Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Debiasing surgeon: fantastic weights and how to find them

Denoising Vision Transformers

Differentiable Product Quantization for Memory Efficient Camera Relocalization

Spline-based Transformers

Learning Pseudo 3D Guidance for View-consistent Texturing with 2D Diffusion

SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data

TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly

Efficient NeRF Optimization - Not All Samples Remain Equally Hard

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Catastrophic Overfitting: A Potential Blessing in Disguise

Adversarial Diffusion Distillation

Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection

Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Text-Conditioned Resampler For Long Form Video Understanding

Using My Artistic Style? You Must Obtain My Authorization

Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

UMERegRobust – Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration

Non-transferable Pruning

A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Affine steerers for structured keypoint description

FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving

GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions

HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation

InstructGIE: Towards Generalizable Image Editing

Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning

CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models

Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation

Towards Scene Graph Anticipation

Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction

Common Sense Reasoning for Deep Fake Detection

GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning

Tight and Efficient Upper Bound on Spectral Norm of Convolutional Layers

Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning

Learning Multimodal Latent Generative Models with Energy-Based Prior

Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution

Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable

CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

Snuffy: Efficient Whole Slide Image Classifier

Learning to Build by Building Your Own Instructions

Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

CoTracker: It is Better to Track Together

Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic

LISO: Lidar-only Self-Supervised 3D Object Detection

Frontier-enhanced Topological Memory with Improved Exploration Awareness for Embodied Visual Navigation

Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)

LookupViT: Compressing visual information to a limited number of tokens

Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization

REDIR: Refocus-free Event-based De-occlusion Image Reconstruction

Towards compact reversible image representations for neural style transfer

InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping

Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data

MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning

GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity

KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Better Regression Makes Better Test-time Adaptive 3D Object Detection

Temporally Consistent Stereo Matching

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy

ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention

Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Benchmarking Spurious Bias in Few-Shot Image Classifiers

Deep Companion Learning: Enhancing Generalization Through Historical Consistency

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Straightforward Layer-wise Pruning for More Efficient Visual Adaptation

ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

CrossScore: A Multi-View Approach to Image Evaluation and Scoring

CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

DiffClass: Diffusion-Based Class Incremental Learning

Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning

PromptFusion: Decoupling Stability and Plasticity for Continual Learning

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

PFGS: High Fidelity Point Cloud Rendering via Feature Splatting

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework

DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception

PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning

Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking

Data Augmentation via Latent Diffusion for Saliency Prediction

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

3D Gaussian Parametric Head Model

Dynamic Neural Radiance Field From Defocused Monocular Video

Retargeting Visual Data with Deformation Fields

Ray-Distance Volume Rendering for Neural Scene Reconstruction

4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Spike-Temporal Latent Representation for Energy-Efficient Event-to-Video Reconstruction

Sur^2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

Realistic Human Motion Generation with Cross-Diffusion Models

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Continuous Memory Representation for Anomaly Detection

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Efficient Depth-Guided Urban View Synthesis

OneRestore: A Universal Restoration Framework for Composite Degradation

Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention

Beyond MOT: Semantic Multi-Object Tracking

PartCraft: Crafting Creative Objects by Parts

WordRobe: Text-Guided Generation of Textured 3D Garments

ZeST: Zero-Shot Material Transfer from a Single Image

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework

Online Continuous Generalized Category Discovery

AddMe: Zero-shot Group-photo Synthesis by Inserting People into Scenes

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning

KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter

MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo

MC-PanDA: Mask Confidence for Panoptic Domain Adaptation

GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting

Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation

Rethinking Few-shot Class-incremental Learning: Learning from Yourself

VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement

STSP: Spatial-Temporal Subspace Projection for Video Class-incremental Learning

Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint

AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Long-CLIP: Unlocking the Long-Text Capability of CLIP

RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF

Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

FuseTeacher: Modality-fused Encoders are Strong Vision Supervisors

MVDD: Multi-View Depth Diffusion Models

Dataset Quantization with Active Learning based Adaptive Sampling

Interpretability-Guided Test-Time Adversarial Defense

Self-Supervised Representation Learning for Adversarial Attack Detection

GroundUp: Rapid Sketch-Based 3D City Massing

Photon Inhibition for Energy-Efficient Single-Photon Imaging

CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning

Learning with Counterfactual Explanations for Radiology Report Generation

Pseudo-Embedding for Generalized Few-Shot Point Cloud Segmentation

Wavelet Convolutions for Large Receptive Fields

AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer

Gradient-based Out-of-Distribution Detection

Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNs

Non-Exemplar Domain Incremental Learning via Cross-Domain Concept Integration

Data-to-Model Distillation: Data-Efficient Learning Framework

Simple Unsupervised Knowledge Distillation With Space Similarity

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

DSMix: Distortion-Induced Saliency Map Based Pre-training for No-Reference Image Quality Assessment

Learning Natural Consistency Representation for Face Forgery Video Detection

DragVideo: Interactive Drag-style Video Editing

Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging

One-Shot Diffusion Mimicker for Handwritten Text Generation

Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning

Multi-Person Pose Forecasting with Individual Interaction Perceptron and Prior Learning

FunQA: Towards Surprising Video Comprehension

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

UpFusion: Novel View Diffusion from Unposed Sparse View Observations

EDformer: Transformer-Based Event Denoising Across Varied Noise Levels

UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation

View-Consistent 3D Editing with Gaussian Splatting

Few-shot NeRF by Adaptive Rendering Loss Regularization

HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes

FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis

Generating Human Interaction Motions in Scenes with Text Control

Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging

MeshSegmenter: Zero-Shot Mesh Segmentation via Texture Synthesis

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

VersatileGaussian: Real-time Neural Rendering for Versatile Tasks using Gaussian Splatting

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

Pyramid Diffusion for Fine 3D Large Scene Generation

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

MotionChain: Conversational Motion Controllers via Multimodal Prompts

Synthesizing Environment-Specific People in Photographs

Open-World Dynamic Prompt and Continual Visual Representation Learning

Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning

Customized Generation Reimagined: Fidelity and Editability Harmonized

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Text2LiDAR: Text-guided LiDAR Point Clouds Generation via Equirectangular Transformer

Co-speech Gesture Video Generation with 3D Human Meshes

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection

Revisit Self-supervision with Local Structure-from-Motion

On the Viability of Monocular Depth Pre-training for Semantic Segmentation

Weakly-supervised Camera Localization by Ground-to-satellite Image Registration

NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

Latent Guard: a Safety Framework for Text-to-image Generation

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

ProtoComp: Diverse Point Cloud Completion with Controllable Prototype

FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation

Physical-Based Event Camera Simulator

Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

Implicit Steganography Beyond the Constraints of Modality

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Volumetric Rendering with Baked Quadrature Fields

Flying with Photons: Rendering Novel Views of Propagating Light

LivePhoto: Real Image Animation with Text-guided Motion Control

Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

High-Fidelity and Transferable NeRF Editing by Frequency Decomposition

Implicit Style-Content Separation using B-LoRA

Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer.

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems

OAPT: Offset-Aware Partition Transformer for Double JPEG Artifacts Removal

Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration

Understanding Multi-compositional learning in Vision and Language models via Category Theory

Animate Your Motion: Turning Still Images into Dynamic Videos

Spatial-Temporal Multi-level Association for Video Object Segmentation

Point-supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance

CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection

Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast

NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Face Reconstruction Transfer Attack as Out-of-Distribution Generalization

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

UniProcessor: A Text-induced Unified Low-level Image Processor

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

Tokenize Anything via Prompting

Visual Alignment Pre-training for Sign Language Translation

GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Arc2Face: A Foundation Model for ID-Consistent Human Faces

Let the Avatar Talk using Texts without Paired Training Data

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Region-centric Image-Language Pretraining for Open-Vocabulary Detection

DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement

Learning Camouflaged Object Detection from Noisy Pseudo Label

PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

Attention Beats Linear for Fast Implicit Neural Representation Generation

WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation

Timestep-Aware Correction for Quantized Diffusion Models

LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models

Prompt-Based Test-Time Real Image Dehazing: A Novel Pipeline

RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-based Continual Learning

FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning

Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge

Emerging Property of Masked Token for Effective Pre-training

OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

3D Hand Sequence Recovery from Real Blurry Images and Event Stream

Sapiens: Foundation for Human Vision Models

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers

Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration

Segmentation-guided Layer-wise Image Vectorization with Gradient Fills

IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination

SAM-guided Graph Cut for 3D Instance Segmentation

GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation

Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

TLControl: Trajectory and Language Control for Human Motion Synthesis

StructLDM: Structured Latent Diffusion for 3D Human Generation

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

High-Fidelity Modeling of Generalizable Wrinkle Deformation

COMPOSE: Comprehensive Portrait Shadow Editing

GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations

Learning Representations from Foundation Models for Domain Generalized Stereo Matching

Distractor-Free Novel View Synthesis via Exploiting Memorization Effect in Optimization

PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration

MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps

NeRF-XL: NeRF at Any Scale with Multi-GPU

NOVUM: Neural Object Volumes for Robust Object Classification

De-confounded Gaze Estimation

3D Hand Pose Estimation in Everyday Egocentric Images

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Controllable Human-Object Interaction Synthesis

Nymeria: A Massive Collection of Egocentric Multi-modal Human Motion in the Wild

Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model

Zero-Shot Multi-Object Scene Completion

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

Personalized Video Relighting With an At-Home Light Stage

Six-Point Method for Multi-Camera Systems with Reduced Solution Space

UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation

Tuning-Free Image Customization with Image and Text Guidance

Stripe Observation Guided Inference Cost-free Attention Mechanism

MegaScenes: Scene-Level View Synthesis at Scale

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging

FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

Non-parametric Sensor Noise Modeling and Synthesis

Learned Image Enhancement via Color Naming

Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation

Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection

Navigating Text-to-Image Generative Bias across Indic Languages

Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

LCM-Lookahead for Encoder-based Text-to-Image Personalization

COIN-Matting: Confounder Intervention for Image Matting

GaussReg: Fast 3D Registration with Gaussian Splatting

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Score Distillation Sampling with Learned Manifold Corrective

WAS: Dataset and Methods for Artistic Text Segmentation

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

BAMM: Bidirectional Autoregressive Motion Model

AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution

Image-adaptive 3D Lookup Tables for Real-time Image Enhancement with Bilateral Grids

Region-Aware Sequence-to-Sequence Learning for Hyperspectral Denoising

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization

Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement

Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

Agent Attention: On the Integration of Softmax and Linear Attention

Fine-grained Dynamic Network for Generic Event Boundary Detection

Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution

Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction

VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network

GraspXL: Generating Grasping Motions for Diverse Objects at Scale

Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition

Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Audio-visual Generalized Zero-shot Learning the Easy Way

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

Reinforcement Learning Friendly Vision-Language Model for Minecraft

GRAPE: Generalizable and Robust Multi-view Facial Capture

R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

Multiscale Sliced Wasserstein Distances as Perceptual Color Difference Measures

SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Structured-NeRF: Hierarchical Scene Graph with Neural Representation

MetaWeather: Few-Shot Weather-Degraded Image Restoration

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

MeshFeat: Multi-Resolution Features for Neural Fields on Meshes

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Surface Reconstruction for 3D Gaussian Splatting via Local Structural Hints

PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion

Learning to Unlearn for Robust Machine Unlearning

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning

MinD-3D: Reconstruct High-quality 3D objects in Human Brain

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

Visual Text Generation in the Wild

Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

A Unified Image Compression Method for Human Perception and Multiple Vision Tasks

Diffusion for Natural Image Matting

Eliminating Warping Shakes for Unsupervised Online Video Stitching

FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification

Facial Affective Behavior Analysis with Instruction Tuning

Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection

Learning Quantized Adaptive Conditions for Diffusion Models

Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation

Discovering Unwritten Visual Classifiers with Large Language Models

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

GenQ: Quantization in Low Data Regimes with Generative Synthetic Data

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Hetecooper: Feature Collaboration Graph for Heterogeneous Collaborative Perception

DATENeRF: Depth-Aware Text-based Editing of NeRFs

Soft Prompt Generation for Domain Generalization

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement

On the Approximation Risk of Few-Shot Class-Incremental Learning

Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Fast Encoding and Decoding for Implicit Video Representation

SAIR: Learning Semantic-aware Implicit Representation

Just a Hint: Point-Supervised Camouflaged Object Detection

Rethinking Normalization Layers for Domain Generalizable Person Re-identification

URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields

Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos

Efficient Cascaded Multiscale Adaptive Network for Image Restoration

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Learning to Drive via Asymmetric Self-Play

Event-based Mosaicing Bundle Adjustment

Robust-Wide: Robust Watermarking against Instruction-driven Image Editing

FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation

OmniSat: Self-Supervised Modality Fusion for Earth Observation

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

TriNeRFLet: A Wavelet Based Triplane NeRF Representation

Uncertainty-Driven Spectral Compressive Imaging with Spatial-Frequency Transformer

milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Toward Tiny and High-quality Facial Makeup with Data Amplify Learning

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Bidirectional Progressive Transformer for Interaction Intention Anticipation

Semantic-guided Robustness Tuning for Few-Shot Transfer Across Extreme Domain Shift

SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow

Domain Reduction Strategy for Non-Line-of-Sight Imaging

Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging

EcoMatcher: Efficient Clustering Oriented Matcher for Detector-free Image Matching

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Vamos: Versatile Action Models for Video Understanding

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation

AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation

ExMatch: Self-guided Exploitation for Semi-Supervised Learning with Scarce Labeled Samples

TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering

CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems

Open-Vocabulary RGB-Thermal Semantic Segmentation

RaFE: Generative Radiance Fields Restoration

denoiSplit: a method for joint microscopy image splitting and unsupervised denoising

UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Efficient Neural Video Representation with Temporally Coherent Modulation

Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution

Unsupervised Moving Object Segmentation with Atmospheric Turbulence

Modeling Label Correlations with Latent Context for Multi-Label Recognition

Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

WindPoly: Polygonal Mesh Reconstruction via Winding Numbers

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation

Towards Reliable Advertising Image Generation Using Human Feedback

Distributionally Robust Loss for Long-Tailed Multi-Label Image Classification

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

Inter-Class Topology Alignment for Efficient Black-Box Substitute Attacks

TurboEdit: Real-time text-based disentangled real image editing

The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples

Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

MoEAD: A Parameter-efficient Model for Multi-class Anomaly Detection

Clean & Compact: Efficient Data-Free Backdoor Defense with Model Compactness

Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation

EraseDraw : Learning to Insert Objects by Erasing Them from Images

Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Attention Prompting on Image for Large Vision-Language Models

ELSE: Efficient Deep Neural Network Inference through Line-based Sparsity Exploration

Personalized Privacy Protection Mask Against Unauthorized Facial Recognition

Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization

A Cephalometric Landmark Regression Method based on Dual-encoder for High-resolution X-ray Image

HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

Viewpoint textual inversion: discovering scene representations and 3D view control in 2D diffusion models

A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability

CipherDM: Secure Three-Party Inference for Diffusion Model Sampling

How to Train the Teacher Model for Effective Knowledge Distillation

LineFit: A Geometric Approach for Fitting Line Segments in Images

CompGS: Smaller and Faster Gaussian Splatting with Vector Quantization

Global Counterfactual Directions

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

ChEX: Interactive Localization and Region Description in Chest X-rays

MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

Grounding Image Matching in 3D with MASt3R

COSMU: Complete 3D human shape from monocular unconstrained images

LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation

Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels

Adaptive Human Trajectory Prediction via Latent Corridors

Generalizable Facial Expression Recognition

RS-NeRF: Neural Radiance Fields from Rolling Shutter Images

MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain

Do Generalised Classifiers really work on Human Drawn Sketches?

Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures

Grid-Attention: Enhancing Computational Efficiency of Large Vision Models without Fine-Tuning

Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection

GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views

Enhanced Motion Forecasting with Visual Relation Reasoning

Multi-scale Cross Distillation for Object Detection in Aerial Images

Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

DSA: Discriminative Scatter Analysis for Early Smoke Segmentation

Long-term Temporal Context Gathering for Neural Video Compression

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

Continuous SO(3) Equivariant Convolution for 3D Point Cloud Analysis

SemanticHuman-HD: High Resolution Semantic disentangled 3D Human Generation

MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks

Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing

Norface: Improving Facial Expression Analysis by Identity Normalization

Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

Bucketed Ranking-based Losses for Efficient Training of Object Detectors

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Test-Time Stain Adaptation with Diffusion Models for Histopathology Image Classification

Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM

Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition

The Nerfect Match: Exploring NeRF Features for Visual Localization

SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization

Image Manipulation Detection With Implicit Neural Representation and Limited Supervision

Adapting to Shifting Correlations with Unlabeled Data Calibration

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients

Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model

Image-to-Lidar Relational Distillation for Autonomous Driving Data

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval

Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Learning-based Axial Video Motion Magnification

IGNORE: Information Gap-based False Negative Loss Rejection for Single Positive Multi-Label Learning

Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization

AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset

RegionDrag: Fast Region-Based Image Editing with Diffusion Models

FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection

Siamese Vision Transformers are Scalable Audio-visual Learners

Rectify the Regression Bias in Long-Tailed Object Detection

Learning Neural Volumetric Pose Features for Camera Localization

Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection

Visual Relationship Transformation

Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Visible and Clear: Finding Tiny Objects in Difference Map

Elysium: Exploring Object-level Perception in Videos through Semantic Integration Using MLLMs

Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

Temporal-Mapping Photography for Event Cameras

RGBD GS-ICP SLAM

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

Rethinking Data Bias: Dataset Copyright Protection via Embedding Class-wise Hidden Bias

Federated Learning with Local Openset Noisy Labels

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

End-to-End Rate-Distortion Optimized 3D Gaussian Representation

Multistain Pretraining for Slide Representation Learning in Pathology

Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

3R-INN: How to be climate friendly while consuming/delivering videos?

ADMap: Anti-disturbance Framework for Vectorized HD Map Construction

GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Self-supervised co-salient object detection via feature correspondences at multiple scales

Improving Knowledge Distillation via Regularizing Feature Direction and Norm

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Reinforcement Learning Meets Visual Odometry

PoseSOR: Human Pose Can Guide Our Attention

Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning

SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo

Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Dual-stage Hyperspectral Image Classification Model with Spectral Supertoken

Optimal Transport of Diverse Unsupervised Tasks for Robust Learning from Noisy Few-Shot Data

Non-Line-of-Sight Estimation of Fast Human Motion with Slow Scanning Imagers

HPE-Li: WiFi-enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation

LITA: Language Instructed Temporal-Localization Assistant

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks

BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow

MEVG : Multi-event Video Generation with Text-to-Video Models

Unsupervised Dense Prediction using Differentiable Normalized Cuts

RPBG: Towards Robust Neural Point-based Graphics in the Wild

uCAP: An Unsupervised Prompting Method for Vision-Language Models

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference

Human Motion Forecasting in Dynamic Domain Shifts: A Homeostatic Continual Test-time Adaptation Framework

Placing Objects in Context via Inpainting for Out-of-distribution Segmentation

Efficient Frequency-Domain Image Deraining with Contrastive Regularization

EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset

Deep Cost Ray Fusion for Sparse Depth Video Completion

Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation

SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning

Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours Maps

Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment

Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection

Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification

Noise-assisted Prompt Learning for Image Forgery Detection and Localization

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

Dynamic Retraining-Updating Mean Teacher for Source-Free Object Detection

Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation

An accurate detection is not all you need to combat label noise in web-noisy datasets

Self-Supervised Video Copy Localization with Regional Token Representation

Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

Zero-shot Object Counting with Good Exemplars

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

Synergy of Sight and Semantics: Visual Intention Understanding with CLIP

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians

Single-Photon 3D Imaging with Equi-Depth Photon Histograms

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

Anytime Continual Learning for Open Vocabulary Classification

Gated Temporal Diffusion for Stochastic Long-term Dense Anticipation

Domain Generalization of 3D Object Detection by Density-Resampling

Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy

Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Causal Subgraphs and Information Bottlenecks: Redefining OOD Robustness in Graph Neural Networks

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects

Watching it in Dark: A Target-aware Representation Learning Framework for High-Level Vision Tasks in Low Illumination

SINDER: Repairing the Singular Defects of DINOv2

Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians

Revisiting Domain-Adaptive Object Detection in Adverse Weather by the Generation and Composition of High-Quality Pseudo-Labels

Open-Set Recognition in the Age of Vision-Language Models

Two-Stage Active Learning for Efficient Temporal Action Segmentation

High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior

ARoFace: Alignment Robustness to Improve Low-quality Face Recognition

Un-EVIMO: Unsupervised Event-based Independent Motion Segmentation

Finding a needle in a haystack: A Black-Box Approach to Invisible Watermark Detection

Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution Images

CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation

Faceptor: A Generalist Model for Face Perception

Shapefusion: 3D localized human diffusion models

LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images

Training A Secure Model against Data-Free Model Extraction

VeCLIP: Improving CLIP Training via Visual-enriched Captions

Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

Neural Metamorphosis

Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data

UniCode : Learning a Unified Codebook for Multimodal Large Language Models

When Fast Fourier Transform Meets Transformer for Image Restoration

DGD: Dynamic 3D Gaussians Distillation

OMR: Occlusion-Aware Memory-Based Refinement for Video Lane Detection

Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation

HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation

Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning

Learning Cross-hand Policies of High-DOF Reaching and Grasping

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion

Probabilistic Weather Forecasting with Deterministic Guidance-based Diffusion Model

Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution

Keypoint Promptable Re-Identification

When and How do negative prompts take effect?

Rethinking Features-Fused-Pyramid-Neck for Object Detection

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes

Learned HDR Image Compression for Perceptually Optimal Storage and Display

FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos

On the Topology Awareness and Generalization Performance of Graph Neural Networks

Accelerating Image Super-Resolution Networks with Pixel-Level Classification

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

Compensation Sampling for Improved Convergence in Diffusion Models

Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting

SkyScenes: A Synthetic Dataset for Aerial Scene Understanding

RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks

KeypointDETR: An End-to-End 3D Keypoint Detector

CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes

IAM-VFI : Interpolate Any Motion for Video Frame Interpolation with motion complexity map

Implicit Neural Models to Extract Heart Rate from Video

Self-Supervised Any-Point Tracking by Contrastive Random Walks

Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions

Statewide Visual Geolocalization in the Wild

Deblurring 3D Gaussian Splatting

SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models

Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography

Layer-Wise Relevance Propagation with Conservation Property for ResNet

Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image

Few-shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt

Decomposition Betters Tracking Everything Everywhere

R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding

Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation

Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations

Controlling the World by Sleight of Hand

Pseudo-Labelling Should Be Aware of Disguising Channel Activations

Towards Architecture-Agnostic Untrained Networks Priors for Image Reconstruction with Frequency Regularization