Skip to yearly menu bar
Skip to main content
Main Navigation
Select Year: (2024)
2024
2022
Create Profile
My Stuff
Login
Select Year: (2024)
2024
2022
Getting Started
Schedule
Tutorials
Workshops
Main Conference
Keynotes
Panels
Orals
Papers
Outstanding Paper Awards
Sponsors
Organizers
Browse
Visualization
mini
compact
topic
detail
Showing papers for
.
×
×
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis
Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection
WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation
Improving Text-guided Object Inpainting with Semantic Pre-inpainting
PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation
Minimalist Vision with Freeform Pixels
RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation
Binomial Self-compensation for Motion Error in Dynamic 3D Scanning
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
Region-Aware Sequence-to-Sequence Learning for Hyperspectral Denoising
On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
ScanTalk: 3D Talking Heads from Unregistered Scans
Vamos: Versatile Action Models for Video Understanding
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models
SLIM: Spuriousness Mitigation with Minimal Human Annotations
OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
Energy-induced Explicit quantification for Multi-modality MRI fusion
Beta-Tuned Timestep Diffusion Model
R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection
Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition
Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction
Agent3D-Zero: An Agent for Zero-shot 3D Understanding
EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification
Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning
Tri^{2}-plane: Thinking Head Avatar via Feature Pyramid
A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models
Open-set Domain Adaptation via Joint Error based Multi-class Positive and Unlabeled Learning
Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection
Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs
Kernel Diffusion: An Alternate Approach to Blind Deconvolution
S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis
Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation
SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution
Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
Agent Attention: On the Integration of Softmax and Linear Attention
SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds
SOS: Segment Object System for Open-World Instance Segmentation With Object Priors
Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection
O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation
BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling
I2-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM
Drag Anything: Motion Control for Anything using Entity Representation
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation
Foster Adaptivity and Balance in Learning with Noisy Labels
UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving
VF-NeRF: Viewshed Fields for Rigid NeRF Registration
Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization
Raising the Ceiling: Conflict-Free Local Feature Matching with Dynamic View Switching
Masked Angle-Aware Autoencoder for Remote Sensing Images
Using My Artistic Style? You Must Obtain My Authorization
VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network
Real-time Holistic Robot Pose Estimation with Unknown States
Classification Matters: Improving Video Action Detection with Class-Specific Attention
Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors
VEON: Vocabulary-Enhanced Occupancy Prediction
PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving
Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases
Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction
ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training
Emerging Property of Masked Token for Effective Pre-training
Generalizable Symbolic Optimizer Learning
Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot
E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance
RegionDrag: Fast Region-Based Image Editing with Diffusion Models
MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation
SENC: Handling Self-collision in Neural Cloth Simulation
GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding
Adaptive Multi-task Learning for Few-shot Object Detection
Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement
Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model
DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration
PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training
StyleCity: Large-Scale 3D Urban Scenes Stylization
Clean & Compact: Efficient Data-Free Backdoor Defense with Model Compactness
Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery
Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors
FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models
Few-shot Defect Image Generation based on Consistency Modeling
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors
Fundamental Matrix Estimation Using Relative Depths
Investigating Style Similarity in Diffusion Models
Topology-Preserving Downsampling of Binary Images
Multi-Person Pose Forecasting with Individual Interaction Perceptron and Prior Learning
GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Self-Cooperation Knowledge Distillation for Novel Class Discovery
H-V2X: A Large Scale Highway Dataset for BEV Perception
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
Insect Identification in the Wild: The AMI Dataset
Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation
Unsupervised Exposure Correction
Mitigating Background Shift in Class-Incremental Semantic Segmentation
SAIR: Learning Semantic-aware Implicit Representation
Unsupervised Multi-modal Medical Image Registration via Invertible Translation
Improving Video Segmentation via Dynamic Anchor Queries
Faceptor: A Generalist Model for Face Perception
Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation
Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
Dual-stage Hyperspectral Image Classification Model with Spectral Supertoken
Open-Vocabulary Camouflaged Object Segmentation
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues
PFGS: High Fidelity Point Cloud Rendering via Feature Splatting
Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning
Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems
SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting
High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs
Dependency-aware Differentiable Neural Architecture Search
PQ-SAM: Post-training Quantization for Segment Anything Model
LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)
ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model
Pathformer3D: A 3D Scanpath Transformer for 360° Images
Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation
Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection
EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models
Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution
Continual Learning for Remote Physiological Measurement: Minimize Forgetting and Simplify Inference
Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context
Optimal Transport of Diverse Unsupervised Tasks for Robust Learning from Noisy Few-Shot Data
Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections
Video Editing via Factorized Diffusion Distillation
Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy
SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark
SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer
MANIKIN: Biomechanically Accurate Neural Inverse Kinematics for Human Motion Estimation
Enriching Information and Preserving Semantic Congruence in Expanding Curvilinear Object Segmentation Datasets
Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation
Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning
Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
Fast Registration of Photorealistic Avatars for VR Facial Animation
DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation
Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion
Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes
EAFormer: Scene Text Segmentation with Edge-Aware Transformers
Improving Zero-Shot Generalization for CLIP with Variational Adapter
On the Evaluation Consistency of Attribution-based Explanations
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models
Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data
See and Think: Embodied Agent in Virtual Environment
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
3D Single-object Tracking in Point Clouds with High Temporal Variation
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection
Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNs
Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation
WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Click-Gaussian: Interactive Segmentation to Any 3D Gaussians
HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting
Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding
GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections
EcoMatcher: Efficient Clustering Oriented Matcher for Detector-free Image Matching
Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models
MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving
FreestyleRet: Retrieving Images from Style-Diversified Queries
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts
PolyRoom: Room-aware Transformer for Floorplan Reconstruction
Straightforward Layer-wise Pruning for More Efficient Visual Adaptation
Test-Time Stain Adaptation with Diffusion Models for Histopathology Image Classification
ChEX: Interactive Localization and Region Description in Chest X-rays
DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion
LongVLM: Efficient Long Video Understanding via Large Language Models
Zero-shot Object Counting with Good Exemplars
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
View-Consistent 3D Editing with Gaussian Splatting
Weighted Ensemble Models Are Strong Continual Learners
Retargeting Visual Data with Deformation Fields
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
PoseSOR: Human Pose Can Guide Our Attention
STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians
Similarity of Neural Architectures using Adversarial Attack Transferability
ParCo: Part-Coordinating Text-to-Motion Synthesis
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
Rotary Position Embedding for Vision Transformer
Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge
DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment
Certifiably Robust Image Watermark
Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images
Sur^2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images
Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis
Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation
Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
Idling Neurons, Appropriately Lenient Workload During Fine-tuning Leads to Better Generalization
Stitched ViTs are Flexible Vision Backbones
Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image
Labeled Data Selection for Category Discovery
MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos
Tiny Models are the Computational Saver for Large Models
An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes
OneRestore: A Universal Restoration Framework for Composite Degradation
Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation
TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data
Self-Training Room Layout via Geometry-aware Ray-casting
IAM-VFI : Interpolate Any Motion for Video Frame Interpolation with motion complexity map
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models
PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation
Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor
Multi-Modal Video Dialog State Tracking in the Wild
LiDAR-Event Stereo Fusion with Hallucinations
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities
HiFi-123: Towards High-fidelity One Image to 3D Content Generation
Efficient Frequency-Domain Image Deraining with Contrastive Regularization
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars
R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Bayesian Self-Training for Semi-Supervised 3D Segmentation
DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception
Reinforcement Learning Friendly Vision-Language Model for Minecraft
Parameterized Quasi-Physical Simulators for Dexterous Manipulations Transfer
A Comparative Study of Image Restoration Networks for General Backbone Network Design
DNI: Dilutional Noise Initialization for Diffusion Video Editing
Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence
GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
All You Need is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition
Diverse Text-to-3D Synthesis with Augmented Text Embedding
Camera Calibration using a Collimator System
Grounding Image Matching in 3D with MASt3R
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
Timestep-Aware Correction for Quantized Diffusion Models
Easing 3D Pattern Reasoning with Side-view Features for Semantic Scene Completion
Personalized Federated Domain-Incremental Learning based on Adaptive Knowledge Matching
Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth
Online Zero-Shot Classification with CLIP
Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment
RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning
Multi-scale Cross Distillation for Object Detection in Aerial Images
SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
Multi-Label Cluster Discrimination for Visual Representation Learning
OMR: Occlusion-Aware Memory-Based Refinement for Video Lane Detection
Appearance-based Refinement for Object-Centric Motion Segmentation
ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization
Human-in-the-Loop Visual Re-ID for Population Size Estimation
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation
CrossScore: A Multi-View Approach to Image Evaluation and Scoring
SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection
Length-Aware Motion Synthesis via Latent Diffusion
Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection
M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions
E.T. the Exceptional Trajectory: Text-to-camera-trajectory generation with character awareness
Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance
Attention Decomposition for Cross-Domain Semantic Segmentation
Retrieval Robust to Object Motion Blur
ReCON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories
MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering
Towards High-Quality 3D Motion Transfer with Realistic Apparel Animation
Visual Grounding for Object-Level Generalization in Reinforcement Learning
Revisit Anything: Visual Place Recognition via Image Segment Retrieval
DoubleTake: Geometry Guided Depth Estimation
Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts
Towards Robust Full Low-bit Quantization of Super Resolution Networks
WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing
SINDER: Repairing the Singular Defects of DINOv2
Unsqueeze [CLS] Bottleneck to Learn Rich Representations
Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation
Gaze Target Detection Based on Head-Local-Global Coordination
Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection
PreLAR: World Model Pre-training with Learnable Action Representation
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning
ProMerge: Prompt and Merge for Unsupervised Instance Segmentation
ExMatch: Self-guided Exploitation for Semi-Supervised Learning with Scarce Labeled Samples
AnimateMe: 4D Facial Expressions via Diffusion Models
Learning Representations from Foundation Models for Domain Generalized Stereo Matching
Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining
Probabilistic Image-Driven Traffic Modeling via Remote Sensing
On Spectral Properties of Gradient-based Explanation Methods
Customized Generation Reimagined: Fidelity and Editability Harmonized
ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting
DataDream: Few-shot Guided Dataset Generation
MEVG : Multi-event Video Generation with Text-to-Video Models
The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations
PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects
4D Contrastive Superflows are Dense 3D Representation Learners
Syn-to-Real Domain Adaptation for Point Cloud Completion via Part-based Approach
Frontier-enhanced Topological Memory with Improved Exploration Awareness for Embodied Visual Navigation
Online Temporal Action Localization with Memory-Augmented Transformer
Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
GenRC: Generative 3D Room Completion from Sparse Image Collections
Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion
COMO: Compact Mapping and Odometry
COMPOSE: Comprehensive Portrait Shadow Editing
Open Vocabulary Multi-Label Video Classification
SUMix: Mixup with Semantic and Uncertain Information
Continuous SO(3) Equivariant Convolution for 3D Point Cloud Analysis
Training A Small Emotional Vision Language Model for Visual Art Comprehension
DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior
SignGen: End-to-End Sign Language Video Generation with Latent Diffusion
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models
Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation
A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks
Event-Aided Time-To-Collision Estimation for Autonomous Driving
CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs
VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space
Prompting Future Driven Diffusion Model for Hand Motion Prediction
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
Attention Beats Linear for Fast Implicit Neural Representation Generation
Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation
Unsupervised Dense Prediction using Differentiable Normalized Cuts
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models
LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation
Image-adaptive 3D Lookup Tables for Real-time Image Enhancement with Bilateral Grids
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Reinforcement Learning Meets Visual Odometry
UniFS: Universal Few-shot Instance Perception with Point Representations
Long-term Temporal Context Gathering for Neural Video Compression
On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now
POA: Pre-training Once for Models of All Sizes
Physically Plausible Color Correction for Neural Radiance Fields
SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning
Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective
CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks
Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation
Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples
EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks
Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework
Fully Sparse 3D Occupancy Prediction
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer
Improving Medical Multi-modal Contrastive Learning with Expert Annotations
OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks
TCC-Det: Temporarily consistent cues for weakly-supervised 3D detection
LPViT: Low-Power Semi-structured Pruning for Vision Transformers
DQ-DETR: DETR with Dynamic Query for Tiny Object Detection
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm
Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution
MLPHand: Real Time Multi-View 3D Hand Reconstruction via MLP Modeling
Pre-trained Visual Dynamics Representations for Efficient Policy Learning
Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection
Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models
TrafficNight : An Aerial Multimodal Benchmark For Nighttime Vehicle Surveillance
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Single-Mask Inpainting for Voxel-based Neural Radiance Fields
FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation
Fast Sprite Decomposition from Animated Graphics
Responsible Visual Editing
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
Let the Avatar Talk using Texts without Paired Training Data
Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph
Real Appearance Modeling for More General Deepfake Detection
LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark
OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding
SEED: A Simple and Effective 3D DETR in Point Clouds
Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations
SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization
Exploring Guided Sampling of Conditional GANs
Rethinking Features-Fused-Pyramid-Neck for Object Detection
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Domain-adaptive Video Deblurring via Test-time Blurring
AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors
Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs
CoReS: Orchestrating the Dance of Reasoning and Segmentation
Computing the Lipschitz constant needed for fast scene recovery from CASSI measurements
Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction
ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion
The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection
OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection
M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation
FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation
Neural Volumetric World Models for Autonomous Driving
Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment
Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks
VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation
PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture
CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control
XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution
VideoMamba: State Space Model for Efficient Video Understanding
QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images
Visual Text Generation in the Wild
ActionVOS: Actions as Prompts for Video Object Segmentation
Agglomerative Token Clustering
A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization
CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
iMatching: Imperative Correspondence Learning
RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency
Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection
Asynchronous Large Language Model Enhanced Planner for Autonomous Driving
Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses
Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing
CIC-BART-SSA: : Controllable Image Captioning with Structured Semantic Augmentation
Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds
Domain Shifting: A Generalized Solution for Heterogeneous Cross-Modality Person Re-Identification
Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging
Diffusion Bridges for 3D Point Cloud Denoising
Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing
A Graph-Based Approach for Category-Agnostic Pose Estimation
Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes
Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning
PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification
Towards a Density Preserving Objective Function for Learning on Point Sets
Unleashing the Power of Prompt-driven Nucleus Instance Segmentation
Attention Prompting on Image for Large Vision-Language Models
Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation
Emergent Visual-Semantic Hierarchies in Image-Text Representations
Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation
3D Hand Sequence Recovery from Real Blurry Images and Event Stream
Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization
VideoStudio: Generating Consistent-Content and Multi-Scene Videos
Uncertainty-Driven Spectral Compressive Imaging with Spatial-Frequency Transformer
Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
Object-Oriented Anchoring and Modal Alignment in Multimodal Learning
ControlCap: Controllable Region-level Captioning
REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation
SemanticHuman-HD: High Resolution Semantic disentangled 3D Human Generation
Pose Guided Fine-Grained Sign Language Video Generation
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution
DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation
Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Towards Reliable Advertising Image Generation Using Human Feedback
Merlin: Empowering Multimodal LLMs with Foresight Minds
RangeLDM: Fast Realistic LiDAR Point Cloud Generation
ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images
Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model
Visual Alignment Pre-training for Sign Language Translation
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
Bidirectional Progressive Transformer for Interaction Intention Anticipation
GVGEN: Text-to-3D Generation with Volumetric Representation
HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis
VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation
Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model
Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data
EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models
Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
Panel-Specific Degradation Representation for Raw Under-Display Camera Image Restoration
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation
Realistic Human Motion Generation with Cross-Diffusion Models
Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching
Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression
FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models
Learning Natural Consistency Representation for Face Forgery Video Detection
Cascade Prompt Learning for Visual-Language Model Adaptation
Dynamic Neural Radiance Field From Defocused Monocular Video
Spatially-Variant Degradation Model for Dataset-free Super-resolution
Dual-Rain: Video Rain Removal using Assertive and Gentle Teachers
Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation
MeshSegmenter: Zero-Shot Mesh Segmentation via Texture Synthesis
Spatial-Temporal Multi-level Association for Video Object Segmentation
A Cephalometric Landmark Regression Method based on Dual-encoder for High-resolution X-ray Image
EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere
Stream Query Denoising for Vectorized HD-Map Construction
Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack
Keypoint Promptable Re-Identification
RealViformer: Investigating Attention for Real-World Video Super-Resolution
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks
On the Approximation Risk of Few-Shot Class-Incremental Learning
OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model
Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution
CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering
U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation
RoadPainter: Points Are Ideal Navigators for Topology transformER
Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition
ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation
SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing
GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image
AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation
MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo
Learning 3D-aware GANs from Unposed Images with Template Feature Field
Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation
Instant Uncertainty Calibration of NeRFs Using a Meta-Calibrator
RS-NeRF: Neural Radiance Fields from Rolling Shutter Images
PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation
3DEgo: 3D Editing on the Go!
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior
Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
LivePhoto: Real Image Animation with Text-guided Motion Control
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Uncertainty-aware sign language video retrieval with probability distribution modeling
WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation
Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation
JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation
Unveiling Privacy Risks in Stochastic Neural Networks Training: Effective Image Reconstruction from Gradients
Progressive Proxy Anchor Propagation for Unsupervised Semantic Segmentation
InterFusion: Text-Driven Generation of 3D Human-Object Interaction
LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models
BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering
Domain Generalization of 3D Object Detection by Density-Resampling
ELSE: Efficient Deep Neural Network Inference through Line-based Sparsity Exploration
GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing
BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification
Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data
SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking
Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework
DreamReward: Aligning Human Preference in Text-to-3D Generation
Deblurring 3D Gaussian Splatting
Adversarial Robustification via Text-to-Image Diffusion Models
ConGeo: Robust Cross-view Geo-localization across Ground View Variations
Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth
Learning Camouflaged Object Detection from Noisy Pseudo Label
Motion Keyframe Interpolation for Any Human Skeleton using Point Cloud-based Human Motion Data Homogenisation
Prompt-Based Test-Time Real Image Dehazing: A Novel Pipeline
CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection
Animate Your Motion: Turning Still Images into Dynamic Videos
Language-Image Pre-training with Long Captions
DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects
Visible and Clear: Finding Tiny Objects in Difference Map
AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling
COSMU: Complete 3D human shape from monocular unconstrained images
Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images
GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection
When Fast Fourier Transform Meets Transformer for Image Restoration
Revisiting Domain-Adaptive Object Detection in Adverse Weather by the Generation and Composition of High-Quality Pseudo-Labels
CipherDM: Secure Three-Party Inference for Diffusion Model Sampling
Dynamic Retraining-Updating Mean Teacher for Source-Free Object Detection
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control
EDformer: Transformer-Based Event Denoising Across Varied Noise Levels
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Interactive 3D Object Detection with Prompts
HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection
Temporally Consistent Stereo Matching
Good Teachers Explain: Explanation-Enhanced Knowledge Distillation
An Information Theoretical View for Out-Of-Distribution Detection
HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance
HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes
Any2Point: Empowering Any-modality Transformers for Efficient 3D Understanding
HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos
PALM: Predicting Actions through Language Models
PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography
DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks
Diff-Reg: Diffusion Model in Doubly Stochastic Matrix Space for Registration Problem
Online Vectorized HD Map Construction using Geometry
Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation
BRAVE: Broadening the visual encoding of vision-language models
ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion
RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting
Revisit Human-Scene Interaction via Space Occupancy
MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution
AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
Zero-Shot Image Feature Consensus with Deep Functional Maps
Event-based Head Pose Estimation: Benchmark and Method
Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning
Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception
Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation
Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation
Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting
Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation
Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training
TAPTR: Tracking Any Point with Transformers as Detection
A New Dataset and Framework for Real-World Blurred Images Super-Resolution
Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model
Learned Image Enhancement via Color Naming
latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
Learned HDR Image Compression for Perceptually Optimal Storage and Display
Towards Scene Graph Anticipation
Diffusion for Natural Image Matting
DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing
Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
General and Task-Oriented Video Segmentation
denoiSplit: a method for joint microscopy image splitting and unsupervised denoising
SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras
Class-Agnostic Object Counting with Text-to-Image Diffusion Model
ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention
CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
MeshFeat: Multi-Resolution Features for Neural Fields on Meshes
Personalized Video Relighting With an At-Home Light Stage
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion
Multimodal Label Relevance Ranking via Reinforcement Learning
CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning
Real-time 3D-aware Portrait Editing from a Single Image
HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning
Trackastra: Transformer-based cell tracking for live-cell microscopy
Audio-driven Talking Face Generation with Stabilized Synchronization Loss
A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging
Nonverbal Interaction Detection
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
Global-to-Pixel Regression for Human Mesh Recovery
Asymmetric Mask Scheme for Self-Supervised Real Image Denoising
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
LEROjD: Lidar Extended Radar-Only Object Detection
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration
Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering
Learning to Complement and to Defer to Multiple Users
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity
Scalable Group Choreography via Variational Phase Manifold Learning
Occupancy as Set of Points
VISA: Reasoning Video Object Segmentation via Large Language Model
TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models
Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation
SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields
Efficient Pre-training for Localized Instruction Generation of Procedural Videos
OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
A Fair Ranking and New Model for Panoptic Scene Graph Generation
GroundUp: Rapid Sketch-Based 3D City Massing
DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences
Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification
Enhanced Motion Forecasting with Visual Relation Reasoning
Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers
Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation
Implicit Steganography Beyond the Constraints of Modality
VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving
Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning
Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement
Training-free Composite Scene Generation for Layout-to-Image Synthesis
GTMS: A Gradient-driven Tree-guided Mask-free Referring Image Segmentation Method
Online Continuous Generalized Category Discovery
Active Generation for Image Classification
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model
FedHARM: Harmonizing Model Architectural Diversity in Federated Learning
TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Autonomous Driving
Frequency-Spatial Entanglement Learning for Camouflaged Object Detection
RGBD GS-ICP SLAM
Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers
Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction
Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection
Deep Companion Learning: Enhancing Generalization Through Historical Consistency
3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing
Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection
Relation DETR: Exploring Explicit Position Relation Prior for Object Detection
Cocktail Universal Adversarial Attack on Deep Neural Networks
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
MotionChain: Conversational Motion Controllers via Multimodal Prompts
CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model
CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation
WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing
High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding
AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale
CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing
SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis
GRA: Detecting Oriented Objects through Group-wise Rotating and Attention
Open-Vocabulary RGB-Thermal Semantic Segmentation
Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors
Online Video Quality Enhancement with Spatial-Temporal Look-up Tables
Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels
Upper-body Hierarchical Graph for Skeleton Based Emotion Recognition in Assistive Driving
Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending
Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics
Training-Free Model Merging for Multi-target Domain Adaptation
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
MambaIR: A Simple Baseline for Image Restoration with State-Space Model
GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting
HARIVO: Harnessing Text-to-Image Models for Video Generation
WHAC: World-grounded Humans and Cameras
Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks
3DSA:Multi-View 3D Human Pose Estimation With 3D Space Attention Mechanisms
MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References
Towards Certifiably Robust Face Recognition
EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS
Unmasking Bias in Diffusion Model Training
HoloADMM: High-Quality Holographic Complex Field Recovery
CPM: Class-conditional Prompting Machine for Audio-visual Segmentation
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control
RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields
Bayesian Detector Combination for Object Detection with Crowdsourced Annotations
Physics-informed Knowledge Transfer for Underwater Monocular Depth Estimation
Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
Referring Atomic Video Action Recognition
CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning
CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos
VideoClusterNet: Self-Supervised and Adaptive Face Clustering for Videos
NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving
SuperGaussian: Repurposing Video Models for 3D Super Resolution
Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations
DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment
OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
Gradient-based Out-of-Distribution Detection
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution
RPBG: Towards Robust Neural Point-based Graphics in the Wild
Generating 3D House Wireframes with Semantics
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Efficient Training of Spiking Neural Networks with Multi-Parallel Implicit Stream Architecture
Distractor-Free Novel View Synthesis via Exploiting Memorization Effect in Optimization
PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion
MoEAD: A Parameter-efficient Model for Multi-class Anomaly Detection
Decomposition Betters Tracking Everything Everywhere
MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation
Leveraging temporal contextualization for video action recognition
Flying with Photons: Rendering Novel Views of Propagating Light
Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction
SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians
RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes
Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks
FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection
Fine-grained Dynamic Network for Generic Event Boundary Detection
Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer
Towards Multimodal Sentiment Analysis Debiasing via Bias Purification
Efficient Vision Transformers with Partial Attention
SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild
Statewide Visual Geolocalization in the Wild
Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition
STSP: Spatial-Temporal Subspace Projection for Video Class-incremental Learning
Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes
Progressive Pretext Task Learning for Human Trajectory Prediction
Learning to Make Keypoints Sub-Pixel Accurate
Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression
LayoutFlow: Flow Matching for Layout Generation
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection
Deep Online Probability Aggregation Clustering
Reliability in Semantic Segmentation: Can We Use Synthetic Data?
Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions
Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection
De-confounded Gaze Estimation
PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments
WindPoly: Polygonal Mesh Reconstruction via Winding Numbers
Monocular Occupancy Prediction for Scalable Indoor Scenes
PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts
PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators
CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field
Motion and Structure from Event-based Normal Flow
SCAPE: A Simple and Strong Category-Agnostic Pose Estimator
Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification
FLAT: Flux-aware Imperceptible Adversarial Attacks on 3D Point Clouds
A high-quality robust diffusion framework for corrupted dataset
Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts
Data Augmentation via Latent Diffusion for Saliency Prediction
DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer
Nuvo: Neural UV Mapping for Unruly 3D Representations
Region-Adaptive Transform with Segmentation Prior for Image Compression
A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control
KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
UNIKD: UNcertainty-Filtered Incremental Knowledge Distillation for Neural Implicit Representation
An Explainable Vision Question Answer Model via Diffusion Chain-of-Thought
AnyHome: Open-Vocabulary Large-Scale Indoor Scene Generation with First-Person View Exploration
Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis
Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction
Shape-guided Configuration-aware Learning for Endoscopic-image-based Pose Estimation of Flexible Robotic Instruments
Efficient Depth-Guided Urban View Synthesis
GRAPE: Generalizable and Robust Multi-view Facial Capture
Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors
CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting
GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection
Flatness-aware Sequential Learning Generates Resilient Backdoors
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
Reprojection Errors as Prompts for Efficient Scene Coordinate Regression
LLM as Copilot for Coarse-grained Vision-and-Language Navigation
Recursive Visual Programming
Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery
PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation
TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling
DragVideo: Interactive Drag-style Video Editing
Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction
Diffusion Models as Data Mining Tools
ReMatching: Low-Resolution Representations for Scalable Shape Correspondence
Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling
High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior
Efficient Snapshot Spectral Imaging: Calibration-Free Parallel Structure with Aperture Diffraction Fusion
Tackling Structural Hallucination in Image Translation with Local Diffusion
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
Compositional Substitutivity of Visual Reasoning for Visual Question Answering
Bi-directional Contextual Attention for 3D Dense Captioning
Self-Supervised Video Desmoking for Laparoscopic Surgery
Ray-Distance Volume Rendering for Neural Scene Reconstruction
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception
Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition
Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation
Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions
CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring
Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally
CountFormer: Multi-View Crowd Counting Transformer
Restoring Images in Adverse Weather Conditions via Histogram Transformer
Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation
HUMOS: Human Motion Model Conditioned on Body Shape
Delving into Adversarial Robustness on Document Tampering Localization
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention
Quantization-Friendly Winograd Transformations for Convolutional Neural Networks
Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions
Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis
DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
TPA3D: Triplane Attention for Fast Text-to-3D Generation
Learning Equilibrium Transformation for Gamut Expansion and Color Restoration
Energy-Clibrated VAE with Test Time Free Lunch
RAW-Adapter: Adapting Pretrained Visual Model to Camera RAW Images
ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation
3x2: 3D Object Part Segmentation by 2D Semantic Correspondences
MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network
DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays
BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression
SNeRV: Spectra-preserving Neural Representation for Video
Multiscale Graph Texture Network
Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors
Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance
MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection
Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
UMBRAE: Unified Multimodal Brain Decoding
Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours Maps
Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models
Geometry Fidelity for Spherical Images
Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View
Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning
Benchmarking the Robustness of Cross-view Geo-localization Models
EgoLifter: Open-world 3D Segmentation for Egocentric Perception
Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction
Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds
Diffusion Model is a Good Pose Estimator from 3D RF-Vision
Object-Aware NIR-to-Visible Translation
CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization
Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction
Enhancing Vectorized Map Perception with Historical Rasterized Maps
Gated Temporal Diffusion for Stochastic Long-term Dense Anticipation
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
Event-Based Motion Magnification
TimeLens-XL: Real-time Event-based Video Frame Interpolation with Large Motion
VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions
M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models
Editable Image Elements for Controllable Synthesis
TurboEdit: Real-time text-based disentangled real image editing
One-stage Prompt-based Continual Learning
Camera-LiDAR Cross-modality Gait Recognition
nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding
Delving Deep into Engagement Prediction of Short Videos
Two-Stage Video Shadow Detection via Temporal-Spatial Adaption
Quanta Video Restoration
ARoFace: Alignment Robustness to Improve Low-quality Face Recognition
MTaDCS: Moving Trace and Feature Density-based Confidence Sample Selection under Label Noise
Enhancing Cross-Subject fMRI-to-Video Decoding with Global-Local Functional Alignment
Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective
Cut out the Middleman: Revisiting Pose-based Gait Recognition
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios
Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
SAH-SCI: Self-Supervised Adapter for Efficient Hyperspectral Snapshot Compressive Imaging
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery
LingoQA: Video Question Answering for Autonomous Driving
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator
ViG-Bias: Visually Grounded Bias Discovery and Mitigation
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Synergy of Sight and Semantics: Visual Intention Understanding with CLIP
FlexAttention for Efficient High-Resolution Vision-Language Models
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation
Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery
InfMAE: A Foundation Model in The Infrared Modality
Unified Medical Image Pre-training in Language-Guided Common Semantic Space
Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models
Snuffy: Efficient Whole Slide Image Classifier
Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion
Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras
Enhanced Sparsification via Stimulative Training
RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception
Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense
Leveraging Imperfect Restoration for Data Availability Attack
Open-Set Recognition in the Age of Vision-Language Models
PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis
Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation
PartSTAD: 2D-to-3D Part Segmentation Task Adaptation
PartCraft: Crafting Creative Objects by Parts
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs
Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off
Self-supervised co-salient object detection via feature correspondences at multiple scales
Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization
Bucketed Ranking-based Losses for Efficient Training of Object Detectors
The Devil is in the Statistics: Mitigating and Exploiting Statistics Difference for Generalizable Semi-supervised Medical Image Segmentation
Semi-supervised Segmentation of Histopathology Images with Noise-Aware Topological Consistency
Multistain Pretraining for Slide Representation Learning in Pathology
A Rotation-invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images
Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data
HERGen: Elevating Radiology Report Generation with Longitudinal Data
Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization
MultiDelete for Multimodal Machine Unlearning
Robustness Preserving Fine-tuning using Neuron Importance
Understanding Multi-compositional learning in Vision and Language models via Category Theory
This Probably Looks Exactly Like That: An Invertible Prototypical Network
Confidence Self-Calibration for Multi-Label Class-Incremental Learning
Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation
Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection
Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition
UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework
Efficient NeRF Optimization - Not All Samples Remain Equally Hard
Differentiable Product Quantization for Memory Efficient Camera Relocalization
Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion
Semantic Residual Prompts for Continual Learning
Representation Enhancement-Stabilization: Reducing Bias-Variance of Domain Generalization
Encapsulating Knowledge in One Prompt
SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments
Compact 3D Scene Representation via Self-Organizing Gaussian Grids
Consistent 3D Line Mapping
Dataset Distillation by Automatic Training Trajectories
Graph Neural Network Causal Explanation via Neural Causal Models
Optimization-based Uncertainty Attribution Via Learning Informative Perturbations
CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction
SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning
MobileNetV4: Universal Models for the Mobile Ecosystem
Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation
Adaptive Parametric Activation
CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
Dataset Enhancement with Instance-Level Augmentations
Efficient Bias Mitigation Without Privileged Information
Momentum Auxiliary Network for Supervised Local Learning
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition
Zero-Shot Detection of AI-Generated Images
Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss
Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
Improving Virtual Try-On with Garment-focused Diffusion Models
TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
Synthesizing Time-varying BRDFs via Latent Space
Resolving Scale Ambiguity in Multi-view 3D Reconstruction using Dual-Pixel Sensors
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos
T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning
LISO: Lidar-only Self-Supervised 3D Object Detection
SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic
LaRa: Efficient Large-Baseline Radiance Fields
RaFE: Generative Radiance Fields Restoration
EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding
LLMCO4MR: LLMs-aided Neural Combinatorial Optimization for Ancient Manuscript Restoration from Fragments with Case Studies on Dunhuang
Neural graphics texture compression supporting random access
A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis
GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views
MVDD: Multi-View Depth Diffusion Models
Hypernetworks for Generalizable BRDF Representation
Non-Line-of-Sight Estimation of Fast Human Motion with Slow Scanning Imagers
Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models
Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization
KeypointDETR: An End-to-End 3D Keypoint Detector
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation
SceneTeller: Language-to-3D Scene Generation
InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping
DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching
FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation
UMERegRobust – Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration
Light-in-Flight for a World-in-Motion
FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation
Osmosis: RGBD Diffusion Prior for Underwater Image Restoration
Photon Inhibition for Energy-Efficient Single-Photon Imaging
Synchronization of Projective Transformations
GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring
LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System
EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation
Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?
Free-Viewpoint Video of Outdoor Sports Using a Drone
Learning Cross-hand Policies of High-DOF Reaching and Grasping
Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training
Deep Cost Ray Fusion for Sparse Depth Video Completion
SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data
MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain
Spike-Temporal Latent Representation for Energy-Efficient Event-to-Video Reconstruction
FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion
Quality Assured: Rethinking Annotation Strategies in Imaging AI
Skeleton Recall Loss for Connectivity Conserving and Resource Efficient Segmentation of Thin Tubular Structures
Sparse Refinement for Efficient High-Resolution Semantic Segmentation
PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors
Unified Local-Cloud Decision-Making via Reinforcement Learning
MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction
Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving
LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow
Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update
Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching
Understanding Physical Dynamics with Counterfactual World Modeling
Towards Real-world Event-guided Low-light Video Enhancement and Deblurring
AugDETR: Improving Multi-scale Learning for Detection Transformer
Accelerating Image Generation with Sub-path Linear Approximation Model
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition
DriveLM: Driving with Graph Visual Question Answering
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Better Regression Makes Better Test-time Adaptive 3D Object Detection
Embodied Understanding of Driving Scenarios
Commonly Interesting Images
The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Zero-shot Text-guided Infinite Image Synthesis with LLM guidance
UGG: Unified Generative Grasping
Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control
Assessing Sample Quality via the Latent Space of Generative Models
DiffiT: Diffusion Vision Transformers for Image Generation
Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
UpFusion: Novel View Diffusion from Unposed Sparse View Observations
LoA-Trans: Enhancing Visual Grounding by Location-Aware Transformers
Robust Nearest Neighbors for Source-Free Domain Adaptation under Class Distribution Shift
Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems
Efficient Cascaded Multiscale Adaptive Network for Image Restoration
Taming Lookup Tables for Efficient Image Retouching
Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic Imaging
Neural Metamorphosis
Forbes: Face Obfuscation Rendering via Backpropagation Refinement Scheme
Generalizable Facial Expression Recognition
RICA^2: Rubric-Informed, Calibrated Assessment of Actions
Semantically Guided Representation Learning For Action Anticipation
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
GraspXL: Generating Grasping Motions for Diverse Objects at Scale
FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation
Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach
Spectral Subsurface Scattering for Material Classification
MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
Human Hair Reconstruction with Strand-Aligned 3D Gaussians
MetaAug: Meta-Data Augmentation for Post-Training Quantization
INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training
PromptFusion: Decoupling Stability and Plasticity for Continual Learning
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
DEAL: Disentangle and Localize Concept-level Explanations for VLMs
Instruction Tuning-free Visual Token Complement for Multimodal LLMs
IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models
LookupViT: Compressing visual information to a limited number of tokens
SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
DεpS: Delayed ε-Shrinking for Faster Once-For-All Training
Integration of Global and Local Representations for Fine-grained Cross-modal Alignment
Diffusion Models for Open-Vocabulary Segmentation
DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing
View Selection for 3D Captioning via Diffusion Ranking
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
Source-Free Domain-Invariant Performance Prediction
MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos
Wavelength-Embedding-guided Filter-Array Transformer for Spectral Demosaicing
CARB-Net: Camera-Assisted Radar-Based Network for Vulnerable Road User Detection
Leveraging Thermal Modality to Enhance Reconstruction in Low-Light Conditions
The Sky's the Limit: Relightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility
RANRAC: Robust Neural Scene Representations via Random Ray Consensus
TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution Detection
ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild
Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation
Grid-Attention: Enhancing Computational Efficiency of Large Vision Models without Fine-Tuning
Rethinking and Improving Visual Prompt Selection for In-Context Learning Segmentation Framework
Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation
McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection
EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models
Enhancing Source-Free Domain Adaptive Object Detection with Low-confidence Pseudo Label Distillation
MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection
ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation
Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector
Fairness-aware Vision Transformer via Debiased Self-Attention
Modality Translation for Object Detection Adaptation without forgetting prior knowledge
Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts
Semantic-guided Robustness Tuning for Few-Shot Transfer Across Extreme Domain Shift
FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning
PixOOD: Pixel-Level Out-of-Distribution Detection
Distributionally Robust Loss for Long-Tailed Multi-Label Image Classification
Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data
CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning
Disentangling Masked Autoencoders for Unsupervised Domain Generalization
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Information Bottleneck Based Data Correction in Continual Learning
Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable
FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients
SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks
Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset
Catastrophic Overfitting: A Potential Blessing in Disguise
LaWa: Using Latent Space for In-Generation Image Watermarking
Rethinking Data Bias: Dataset Copyright Protection via Embedding Class-wise Hidden Bias
Towards Model-Agnostic Dataset Condensation by Heterogeneous Models
VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description
Adaptive Correspondence Scoring for Unsupervised Medical Image Registration
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Revisiting Calibration of Wide-Angle Radially Symmetric Cameras
Weakly-Supervised 3D Hand Reconstruction with Knowledge Prior and Uncertainty Guidance
SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images
CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos
FoundPose: Unseen Object Pose Estimation with Foundation Features
Continuous Memory Representation for Anomaly Detection
Differentiable Convex Polyhedra Optimization from Multi-view Images
HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation
3D Human Pose Estimation via Non-Causal Retentive Networks
Diffusion Reward: Learning Rewards via Conditional Video Diffusion
Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation
RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation
Stable Video Portraits
S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition
Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection
Depth-guided NeRF Training via Earth Mover’s Distance
Watch Your Steps: Local Image and Scene Editing by Text Instructions
TriNeRFLet: A Wavelet Based Triplane NeRF Representation
SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Unifying 3D Vision-Language Understanding via Promptable Queries
SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction
Revising Densification in Gaussian Splatting
PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance
Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping
Scene-Conditional 3D Object Stylization and Composition
SMooDi: Stylized Motion Diffusion Model
ZeST: Zero-Shot Material Transfer from a Single Image
BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
3D Congealing: 3D-Aware Image Alignment in the Wild
Learning Neural Deformation Representation for 4D Dynamic Shape Generation
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
Efficient Training with Denoised Neural Weights
GIVT: Generative Infinite-Vocabulary Transformers
Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians
AddMe: Zero-shot Group-photo Synthesis by Inserting People into Scenes
Neural Surface Detection for Unsigned Distance Fields
Transferable 3D Adversarial Shape Completion using Diffusion Models
JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention
Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices
GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence
MONTAGE: Monitoring Training for Attribution of Generative Diffusion Models
AdversariaLeak: External Information Leakage Attack Using Adversarial Samples on Face Recognition Systems
Rotated Orthographic Projection for Self-Supervised 3D Human Pose Estimation
Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding
Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image
Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather
Memory-Efficient Fine-Tuning for Quantized Diffusion Model
SeiT++: Masked Token Modeling Improves Storage-efficient Training
Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion
Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration
Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding
SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models
Formula-Supervised Visual-Geometric Pre-training
LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection
Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection
MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception
UAV First-Person Viewers Are Radiance Field Learners
Caltech Aerial RGB-Thermal Dataset in the Wild
V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception
CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction
DiffFAS: Face Anti-Spoofing via Generative Diffusion Models
PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation
Rethinking Few-shot Class-incremental Learning: Learning from Yourself
Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection
An Economic Framework for 6-DoF Grasp Detection
EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Direct Distillation between Different Domains
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Motion Aware Event Representation-driven Image Deblurring
Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions
Temporal-Mapping Photography for Event Cameras
COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
Long-CLIP: Unlocking the Long-Text Capability of CLIP
RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF
FuseTeacher: Modality-fused Encoders are Strong Vision Supervisors
StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models
MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation
PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking
DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution
PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control
ReMamber: Referring Image Segmentation with Mamba Twister
Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization
IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection
HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images
Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning
Continual Learning and Unknown Object Discovery in 3D Scenes via Self-Distillation
A Simple Background Augmentation Method for Object Detection with Diffusion Model
∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions
Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks
Compensation Sampling for Improved Convergence in Diffusion Models
Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter
Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction
Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction
Implicit Neural Models to Extract Heart Rate from Video
A Watermark-Conditioned Diffusion Model for IP Protection
Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures
Image Manipulation Detection With Implicit Neural Representation and Limited Supervision
PetFace: A Large-Scale Dataset and Benchmark for Animal Identification
NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation
Occlusion-Aware Seamless Segmentation
Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds
Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation
SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment
Self-Supervised Video Copy Localization with Regional Token Representation
Learning with Counterfactual Explanations for Radiology Report Generation
Placing Objects in Context via Inpainting for Out-of-distribution Segmentation
Elysium: Exploring Object-level Perception in Videos through Semantic Integration Using MLLMs
Pseudo-Embedding for Generalized Few-Shot Point Cloud Segmentation
Learning Diffusion Models for Multi-View Anomaly Detection
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Learning Representations of Satellite Images From Metadata Supervision
Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring
Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition
Wavelet Convolutions for Large Receptive Fields
Trainable Highly-expressive Activation Functions
SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images
V-IRL: Grounding Virtual Intelligence in Real Life
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Chains of Diffusion Models
Grounding Language Models for Visual Entity Recognition
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception
UniCode : Learning a Unified Codebook for Multimodal Large Language Models
Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning
Non-Exemplar Domain Incremental Learning via Cross-Domain Concept Integration
Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities
Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models
Simple Unsupervised Knowledge Distillation With Space Similarity
InstructGIE: Towards Generalizable Image Editing
StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion
3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Towards Open-Ended Visual Recognition with Large Language Models
AFreeCA: Annotation-Free Counting for All
InstructIR: High-Quality Image Restoration Following Human Instructions
Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs
DSMix: Distortion-Induced Saliency Map Based Pre-training for No-Reference Image Quality Assessment
Dataset Growth
Dolphins: Multimodal Language Model for Driving
Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation
MO-EMT-NAS: Multi-Objective Continuous Transfer of Architectural Knowledge Between Tasks from Different Datasets
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
AttnZero: Efficient Attention Discovery for Vision Transformers
Auto-GAS: Automated Proxy Discovery for Training-free Generative Architecture Search
Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation
One-Shot Diffusion Mimicker for Handwritten Text Generation
KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Prioritized Semantic Learning for Zero-shot Instance Navigation
Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos
Temporal Residual Jacobians for Rig-free Motion Transfer
Part2Object: Hierarchical Unsupervised 3D Instance Segmentation
Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond
Operational Open-Set Recognition and PostMax Refinement
AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking
Effective Lymph Nodes Detection in CT Scans Using Location Debiased Query Selection and Contrastive Query Representation in Transformer
TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection
Learning with Unmasked Tokens Drives Stronger Vision Learners
Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt
Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision
Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding
The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
FYI: Flip Your Images for Dataset Distillation
Contrastive Learning with Synthetic Positives
Tight and Efficient Upper Bound on Spectral Norm of Convolutional Layers
Data-to-Model Distillation: Data-Efficient Learning Framework
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment
FunQA: Towards Surprising Video Comprehension
Robust Calibration of Large Vision-Language Adapters
FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning
Benchmarking Spurious Bias in Few-Shot Image Classifiers
Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network
ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection
Adapting to Shifting Correlations with Unlabeled Data Calibration
Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels
De-Confusing Pseudo-Labels in Source-Free Domain Adaptation
Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach
Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation
Feature Diversification and Adaptation for Federated Domain Generalization
Harmonizing knowledge Transfer in Neural Network with Unified Distillation
PFedEdit: Personalized Federated Learning via Automated Model Editing
Layer-Wise Relevance Propagation with Conservation Property for ResNet
Training A Secure Model against Data-Free Model Extraction
Augmented Neural Fine-tuning for Efficient Backdoor Purification
MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition
Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information
UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation
Few-shot NeRF by Adaptive Rendering Loss Regularization
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
Weight Conditioning for Smooth Optimization of Neural Networks
Invertible Neural Warp for NeRF
Generating Human Interaction Motions in Scenes with Text Control
VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing
MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping
CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians
Towards Image Ambient Lighting Normalization
MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction
Efficient Neural Video Representation with Temporally Coherent Modulation
FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting
VersatileGaussian: Real-time Neural Rendering for Versatile Tasks using Gaussian Splatting
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction
Motion Mamba: Efficient and Long Sequence Motion Generation
ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation
Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance
Pyramid Diffusion for Fine 3D Large Scene Generation
Taming Latent Diffusion Model for Neural Radiance Field Inpainting
HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras
Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models
POET: Prompt Offset Tuning for Continual Human Action Adaptation
NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
Modeling and Driving Human Body Soundfields through Acoustic Primitives
Kinetic Typography Diffusion Model
3R-INN: How to be climate friendly while consuming/delivering videos?
Rethinking Unsupervised Outlier Detection via Multiple Thresholding
Domain Reduction Strategy for Non-Line-of-Sight Imaging
Intrinsic Single-Image HDR Reconstruction
NeRF-XL: NeRF at Any Scale with Multi-GPU
Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering
Synthesizing Environment-Specific People in Photographs
GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views
Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization
HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation
FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation
Collaborative Control for Geometry-Conditioned PBR Image Generation
Distilling Knowledge from Large-Scale Image Models for Object Detection
Text2LiDAR: Text-guided LiDAR Point Clouds Generation via Equirectangular Transformer
URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields
Reinforcement Learning via Auxillary Task Distillation
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields
Concise Plane Arrangements for Low-Poly Surface and Volume Modelling
SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction
Revisit Self-supervision with Local Structure-from-Motion
Open-Set Biometrics: Beyond Good Closed-Set Models
Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry
CoTracker: It is Better to Track Together
On the Viability of Monocular Depth Pre-training for Semantic Segmentation
Weakly-supervised Camera Localization by Ground-to-satellite Image Registration
AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation
FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis
SemReg: Semantics Constrained Point Cloud Registration
NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation
Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation
CompGS: Smaller and Faster Gaussian Splatting with Vector Quantization
Latent Guard: a Safety Framework for Text-to-image Generation
SPIRE: Semantic Prompt-Driven Image Restoration
ControlLLM: Augment Language Models with Tools by Searching on Graphs
AWOL: Analysis WithOut synthesis using Language
GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns
Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching
Scalar Function Topology Divergence: Comparing Topology of 3D Objects
Finding Visual Task Vectors
EgoPet: Egomotion and Interaction Data from an Animal's Perspective
Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement
Frugal 3D Point Cloud Model Training via Progressive Near Point Filtering and Fused Aggregation
ProtoComp: Diverse Point Cloud Completion with Controllable Prototype
DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation
3D Small Object Detection with Dynamic Spatial Pruning
Adaptive Annealing for Robust Averaging
Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization
DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction
Learning Pseudo 3D Guidance for View-consistent Texturing with 2D Diffusion
LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation
Physical-Based Event Camera Simulator
Learning Neural Volumetric Pose Features for Camera Localization
ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation
OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos
LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception
MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection
Volumetric Rendering with Baked Quadrature Fields
Exact Diffusion Inversion via Bidirectional Integration Approximation
Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation
Lagrangian Hashing for Compressed Neural Field Representations
PointNeRF++: A multi-scale, point-based Neural Radiance Field
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
Fast Encoding and Decoding for Implicit Video Representation
LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models
Do text-free diffusion models learn discriminative visual representations?
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
High-Fidelity and Transferable NeRF Editing by Frequency Decomposition
Implicit Style-Content Separation using B-LoRA
MyVLM: Personalizing VLMs for User-Specific Queries
Curved Diffusion: A Generative Model With Optical Geometry Control
Lazy Diffusion Transformer for Interactive Image Editing
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
Continuity Preserving Online CenterLine Graph Learning
GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images
AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation
Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs
FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Rate-Distortion-Cognition Controllable Versatile Neural Image Compression
Closed-Loop Unsupervised Representation Disentanglement with $\beta$-VAE Distillation and Diffusion Probabilistic Feedback
Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
Blind image deblurring with noise-robust kernel estimation
Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems
Self-Guided Generation of Minority Samples Using Diffusion Models
OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation
DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation
CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models
ZigMa: A DiT-style Zigzag Mamba Diffusion Model
FMBoost: Boosting Latent Diffusion with Flow Matching
WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians
BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events
Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing
Towards compact reversible image representations for neural style transfer
InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser
ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model
When and How do negative prompts take effect?
Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models
WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model
Global Counterfactual Directions
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
Point-supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance
General Geometry-aware Weakly Supervised 3D Object Detection
OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers
CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians
Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling
Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models
UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt
Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation
Regularizing Dynamic Radiance Fields with Kinematic Fields
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers
Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients
Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network
Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems
NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image
HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization
Face Reconstruction Transfer Attack as Out-of-Distribution Generalization
Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing
Norface: Improving Facial Expression Analysis by Identity Normalization
Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals
VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG
Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders
Bayesian Evidential Deep Learning for Online Action Detection
Event Camera Data Dense Pre-training
Unsupervised Moving Object Segmentation with Atmospheric Turbulence
Beyond MOT: Semantic Multi-Object Tracking
Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition
Rethinking Image-to-Video Adaptation: An Object-centric Perspective
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
PointLLM: Empowering Large Language Models to Understand Point Clouds
StableDrag: Stable Dragging for Point-based Image Editing
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic Scenes
ST-LLM: Large Language Models Are Effective Temporal Learners
Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition
Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data
Navigation Instruction Generation with BEV Perception and Large Language Models
UniProcessor: A Text-induced Unified Low-level Image Processor
Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation
Tokenize Anything via Prompting
Think before Placement: Common Sense Enhanced Transformer for Object Placement
HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs
T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models
Learning to Adapt SAM for Segmenting Cross-domain Point Clouds
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
MultiGen: Zero-shot Image Generation from Multi-modal Prompts
VeCLIP: Improving CLIP Training via Visual-enriched Captions
MC-PanDA: Mask Confidence for Panoptic Domain Adaptation
Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models
GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering
Look Hear: Gaze Prediction for Speech-directed Human Attention
Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing
You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
I Can't Believe It's Not Scene Flow!
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
Evaluating Text-to-Visual Generation with Image-to-Text Generation
DOCCI: Descriptions of Connected and Contrasting Images
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection
A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting
3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views
Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation
Arc2Face: A Foundation Model for ID-Consistent Human Faces
SAGS: Structure-Aware 3D Gaussian Splatting
PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery
FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions
MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory
Towards More Practical Group Activity Detection: A New Benchmark and Model
Distilling Diffusion Models into Conditional GANs
On-the-fly Category Discovery for LiDAR Semantic Segmentation
Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving
OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation
Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights
Watching it in Dark: A Target-aware Representation Learning Framework for High-Level Vision Tasks in Low Illumination
Gradient-Aware for Class-Imbalanced Semi-supervised Medical Image Segmentation
DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation
AMD: Automatic Multi-step Distillation of Large-scale Vision Models
Linking in Style: Understanding learned features in deep learning models
On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition
Robust Multimodal Learning via Representation Decoupling
Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort
Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning
Strike a Balance in Continual Panoptic Segmentation
IGNORE: Information Gap-based False Negative Loss Rejection for Single Positive Multi-Label Learning
Learning to Distinguish Samples for Generalized Category Discovery
HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation
DiffClass: Diffusion-Based Class Incremental Learning
Adversarial Prompt Tuning for Vision-Language Models
Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution Images
How to Train the Teacher Model for Effective Knowledge Distillation
Dataset Quantization with Active Learning based Adaptive Sampling
Local and Global Flatness for Federated Domain Generalization
Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search
Cross-Input Certified Training for Universal Perturbations
Interpretability-Guided Test-Time Adversarial Defense
Self-Supervised Representation Learning for Adversarial Attack Detection
On the Vulnerability of Skip Connections to Model Inversion Attacks
Non-transferable Pruning
CONDA: Condensed Deep Association Learning for Co-Salient Object Detection.
PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning
SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection
Just a Hint: Point-Supervised Camouflaged Object Detection
Controllable Navigation Instruction Generation with Chain of Thought Prompting
A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability
PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition
COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation
Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views
Learning Multimodal Latent Generative Models with Energy-Based Prior
Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization
FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation
IRGen: Generative Modeling for Image Retrieval
Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer
3D Gaussian Parametric Head Model
I-MedSAM: Implicit Medical Image Segmentation with Segment Anything
Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging
MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models
A Probability-guided Sampler for Neural Implicit Surface Rendering
G2fR: Frequency Regularization in Grid-based Feature Encoding Neural Radiance Fields
InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction
DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators
Vista3D: unravel the 3d darkside of a single image
Isomorphic Pruning for Vision Models
GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval
Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap
Adaptive Multi-head Contrastive Learning
RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-based Continual Learning
Rethinking Normalization Layers for Domain Generalizable Person Re-identification
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery
A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties
Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling
MagMax: Leveraging Model Merging for Seamless Continual Learning
Revisiting Supervision for Continual Representation Learning
Exemplar-free Continual Representation Learning via Learnable Drift Compensation
Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting
Debiasing surgeon: fantastic weights and how to find them
Privacy-Preserving Adaptive Re-Identification without Image Transfer
An Adaptive Screen-Space Meshing Approach for Normal Integration
GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time
Rasterized Edge Gradients: Handling Discontinuities Differentially
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
SRPose: Two-view Relative Pose Estimation with Sparse Keypoints
Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis
MagicMirror: Fast and High-Quality Avatar Generation with Constrained Search Space
3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation
Viewpoint textual inversion: discovering scene representations and 3D view control in 2D diffusion models
Sapiens: Foundation for Human Vision Models
Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture
Expressive Whole-Body 3D Gaussian Avatar
Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model
DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly
Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images
SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers
MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views
Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction
Disentangled Generation and Aggregation for Robust Radiance Fields
Dual-Camera Smooth Zoom on Mobile Phones
Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences
Multi-modal Crowd Counting via a Broker Modality
SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes
Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance
Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians
DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction
Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination
SAM-guided Graph Cut for 3D Instance Segmentation
CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches
GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation
Text to Layer-wise 3D Clothed Human Generation
DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose
Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction
HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting
Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion
TLControl: Trajectory and Language Control for Human Motion Synthesis
TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds
EINet: Point Cloud Completion via Extrapolation and Interpolation
CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection
EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Human Motion Generation
Disentangled Clothed Avatar Generation from Text Descriptions
StructLDM: Structured Latent Diffusion for 3D Human Generation
FreeInit: Bridging Initialization Gap in Video Diffusion Models
ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance
TC4D: Trajectory-Conditioned Text-to-4D Generation
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning
DGD: Dynamic 3D Gaussians Distillation
SHIC: Shape-Image Correspondences with no Keypoint Supervision
LineFit: A Geometric Approach for Fitting Line Segments in Images
MMBENCH: Is Your Multi-Modal Model an All-around Player?
Large Motion Model for Unified Multi-Modal Motion Generation
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Robust Fitting on a Gate Quantum Computer
The Nerfect Match: Exploring NeRF Features for Visual Localization
Physics-Based Interaction with 3D Objects via Video Generation
IMMA: Immunizing text-to-image Models against Malicious Adaptation
HPE-Li: WiFi-enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation
Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time
Learning to Obstruct Few-Shot Image Classification over Restricted Classes
UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model
iHuman: Instant Animatable Digital Humans From Monocular Videos
Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging
Text-Conditioned Resampler For Long Form Video Understanding
EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset
GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering
D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction
Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning
R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding
Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding
FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation
AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion
Adaptive Human Trajectory Prediction via Latent Corridors
DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding
Holodepth: Programmable Depth-Varying Projection via Computer-Generated Holography
Flowed Time of Flight Radiance Fields
Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering
Denoising Vision Transformers
SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather
LiDAR-based All-weather 3D Object Detection via Prompting and Distilling 4D Radar
SkyScenes: A Synthetic Dataset for Aerial Scene Understanding
ViPer: Visual Personalization of Generative Models via Individual Preference Learning
How Far Can a 1-Pixel Camera Go? Solving Vision Tasks using Photoreceptors and Computationally Designed Visual Morphology
Temporal Event Stereo via Joint Learning with Stereoscopic Flow
Early Anticipation of Driving Maneuvers
Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention
Probabilistic Weather Forecasting with Deterministic Guidance-based Diffusion Model
FARSE-CNN: Fully Asynchronous, Recurrent and Sparse Event-Based CNN
Event-Adapted Video Super-Resolution
Diffusion Models as Optimizers for Efficient Planning in Offline RL
Scene-aware Human Motion Forecasting via Mutual Distance Prediction
TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
Track Everything Everywhere Fast and Robustly
CoMo: Controllable Motion Generation through Language Guided Pose Code Editing
Gravity-aligned Rotation Averaging with Circular Regression
Semicalibrated Relative Pose from an Affine Correspondence and Monodepth
MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps
Global Structure-from-Motion Revisited
GeoCalib: Learning Single-image Calibration with Geometric Optimization
Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information
Where am I? Scene Retrieval with Language
Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels
SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
StereoGlue: Joint Feature Matching and Robust Estimation
Robust Incremental Structure-from-Motion with Hybrid Features
Relightable Neural Actor with Intrinsic Decomposition and Pose Control
Controlling the World by Sleight of Hand
DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields
Co-speech Gesture Video Generation with 3D Human Meshes
NOVUM: Neural Object Volumes for Robust Object Classification
Solving Motion Planning Tasks with a Scalable Generative Model
iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images
PACE: Pose Annotations in Cluttered Environments
Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
3D Hand Pose Estimation in Everyday Egocentric Images
3D Reconstruction of Objects in Hands without Real World 3D Supervision
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding
Controllable Human-Object Interaction Synthesis
Nymeria: A Massive Collection of Egocentric Multi-modal Human Motion in the Wild
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
Factorized Diffusion: Perceptual Illusions by Noise Decomposition
DiffusionPen: Towards Controlling the Style of Handwritten Text Generation
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
WordRobe: Text-Guided Generation of Textured 3D Garments
Deep Patch Visual SLAM
SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow
GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density
Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
CityGuessr: City-Level Video Geo-Localization on a Global Scale
Möbius Transform for Mitigating Perspective Distortions in Representation Learning
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery
Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning
Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection
Free Lunch for Gait Recognition: A Novel Relation Descriptor
Generative End-to-End Autonomous Driving
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding
Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries
Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation
Learning to Build by Building Your Own Instructions
G3R: Gradient Guided Generalizable Reconstruction
DeTra: A Unified Model for Object Detection and Trajectory Forecasting
UniCal: Unified Neural Sensor Calibration
NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration
BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting
RSL-BA: Rolling Shutter Line Bundle Adjustment
BeNeRF:Neural Radiance Fields from a Single Blurry Image and Event Stream
Six-Point Method for Multi-Camera Systems with Reduced Solution Space
Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection
Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations
AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation
UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation
SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow
AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection
EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation
HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models
MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration
TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts
Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
Confidence-Based Iterative Generation for Real-World Image Super-Resolution
Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation
Tuning-Free Image Customization with Image and Text Guidance
Dolfin: Diffusion Layout Transformers without Autoencoder
Fast View Synthesis of Casual Videos with Soup-of-Planes
Explorative Inbetweening of Time and Space
Accelerating Image Super-Resolution Networks with Pixel-Level Classification
SAVE: Protagonist Diversification with Structure Agnostic Video Editing
Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks
Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision
SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder
RadEdit: stress-testing biomedical vision models via diffusion image editing
Stripe Observation Guided Inference Cost-free Attention Mechanism
Trajectory-aligned Space-time Tokens for Few-shot Action Recognition
Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design
DSA: Discriminative Scatter Analysis for Early Smoke Segmentation
FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing
BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation
Factorizing Text-to-Video Generation by Explicit Image Conditioning
AdaIFL: Adaptive Image Forgery Localization via a Dynamic and Importance-aware Transformer Network
Toward Tiny and High-quality Facial Makeup with Data Amplify Learning
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
When Do We Not Need Larger Vision Models?
Open Panoramic Segmentation
SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging
Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation
Self-Supervised Any-Point Tracking by Contrastive Random Walks
A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks
Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM
Non-parametric Sensor Noise Modeling and Synthesis
ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement
SegPoint: Segment Any Point Cloud via Large Language Model
Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams
Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks
Improving image synthesis with diffusion-negative sampling
ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos
Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement
Contextual Correspondence Matters: Bidirectional Graph Matching for Video Summarization
Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment
MagicEraser: Erasing Any Objects via Semantics-Aware Control
Cross-Domain Learning for Video Anomaly Detection with Limited Supervision
GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling
EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head
MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections
TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning
LITA: Language Instructed Temporal-Localization Assistant
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing
Siamese Vision Transformers are Scalable Audio-visual Learners
EvSign: Sign Language Recognition and Translation with Streaming Events
Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification
Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL
Take A Step Back: Rethinking the Two Stages in Visual Reasoning
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection
TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
Navigating Text-to-Image Generative Bias across Indic Languages
HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions
Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction
Improving Diffusion Models for Authentic Virtual Try-on in the Wild
Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
ReNoise: Real Image Inversion Through Iterative Noising
LCM-Lookahead for Encoder-based Text-to-Image Personalization
COIN-Matting: Confounder Intervention for Image Matting
GaussReg: Fast 3D Registration with Gaussian Splatting
SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation
DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors
Parrot Captions Teach CLIP to Spot Text
Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
Learning Video Context as Interleaved Multimodal Sequences
Score Distillation Sampling with Learned Manifold Corrective
Instant 3D Human Avatar Generation using Image Diffusion Models
WAS: Dataset and Methods for Artistic Text Segmentation
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing
FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Thinking Outside the BBox: Unconstrained Generative Object Compositing
Free-Editor: Zero-shot Text-driven 3D Scene Editing
Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
Towards Multi-modal Transformers in Federated Learning
LatentEditor: Text Driven Local Editing of 3D Scenes
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
BAMM: Bidirectional Autoregressive Motion Model
From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
Spline-based Transformers
Lossy Image Compression with Foundation Diffusion Models
Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models
Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution
DEVIAS: Learning Disentangled Video Representations of Action and Scene
Neural Spectral Decomposition for Dataset Distillation
ReGround: Improving Textual and Spatial Grounding at No Cost
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation
RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation
TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
Multi-branch Collaborative Learning Network for 3D Visual Grounding
Enhancing Tampered Text Detection through Frequency Feature Fusion and Decomposition
Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents
CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection
DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation
Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement
Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
GalLop: Learning global and local prompts for vision-language models
Can OOD Object Detectors Learn from Foundation Models?
SAFARI: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition
MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion
Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation
SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models
Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution
Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition
Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts
VideoMamba: Spatio-Temporal Selective State Space Model
Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition
Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation
Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition
GRiT: A Generative Region-to-text Transformer for Object Understanding
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images
Click Prompt Learning with Optimal Transport for Interactive Segmentation
Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
Multi-Sentence Grounding for Long-term Instructional Video
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLMGA: Multimodal Large Language Model based Generation Assistant
Audio-visual Generalized Zero-shot Learning the Easy Way
Audio-Synchronized Visual Animation
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
Implicit Concept Removal of Diffusion Models
WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation
Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
Benchmarking Object Detectors with COCO: A New Path Forward
CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model
AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval
Pix2Gif: Motion-Guided Diffusion for GIF Generation
cDP-MIL: Robust Multiple Instance Learning via Cascaded Dirichlet Process
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation
EA-VTR: Event-Aware Video-Text Retrieval
Multiscale Sliced Wasserstein Distances as Perceptual Color Difference Measures
SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images
X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
Learning-based Axial Video Motion Magnification
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models
BAFFLE: A Baseline of Backpropagation-Free Federated Learning
TrojVLM: Backdoor Attack Against Vision Language Models
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models
LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment
Structured-NeRF: Hierarchical Scene Graph with Neural Representation
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models
Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent
Pseudo-Labelling Should Be Aware of Disguising Channel Activations
SNP: Structured Neuron-level Pruning to Preserve Attention Scores
To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of Point Cloud Transfer Learning
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning
Unsupervised Representation Learning by Balanced Self Attention Matching
HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion
Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation
Diagnosing and Re-learning for Balanced Multimodal Learning
Visual Prompting via Partial Optimal Transport
MetaWeather: Few-Shot Weather-Degraded Image Restoration
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild
Modeling Label Correlations with Latent Context for Multi-Label Recognition
Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch
X-Pose: Detecting Any Keypoints
Segment and Recognize Anything at Any Granularity
Making Large Language Models Better Planners with Reasoning-Decision Alignment
Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting
Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams
Few-shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance
Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments
OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition
Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics
Combining Generative and Geometry Priors for Wide-Angle Portrait Correction
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency
Eliminating Feature Ambiguity for Few-Shot Segmentation
GroupDiff: Diffusion-based Group Portrait Editing
LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation
Loc3Diff: Local Diffusion for 3D Human Head Synthesis and Editing
Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment
DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting
Improving Adversarial Transferability via Model Alignment
Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures
UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects
Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging
Surface Reconstruction for 3D Gaussian Splatting via Local Structural Hints
Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal
3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting
Adversarial Diffusion Distillation
MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition
LogoSticker: Inserting Logos into Diffusion Models for Customized Generation
Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation
PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Learning to Unlearn for Robust Machine Unlearning
Towards Physical World Backdoor Attacks against Skeleton Action Recognition
Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers
Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation
Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning
MinD-3D: Reconstruct High-quality 3D objects in Human Brain
Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
Self-supervised Shape Completion via Involution and Implicit Correspondences
MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment
Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits
C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
AMEGO: Active Memory from long EGOcentric videos
Towards Neuro-Symbolic Video Understanding
SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance
SILC: Improving Vision Language Pretraining with Self-Distillation
MoVideo: Motion-Aware Video Generation with Diffusion Models
MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty
SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking
Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector
MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation
Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection
Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views
E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation
A Unified Image Compression Method for Human Perception and Multiple Vision Tasks
Region-Native Visual Tokenization
Towards Dual Transparent Liquid Level Estimation in Biomedical Lab: Dataset, Methods and Practice
Eliminating Warping Shakes for Unsupervised Online Video Stitching
FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification
Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation
Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network
Facial Affective Behavior Analysis with Instruction Tuning
Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement
An Incremental Unified Framework for Small Defect Inspection
Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation
Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Towards Open-ended Visual Quality Comparison
D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On
GenQ: Quantization in Low Data Regimes with Generative Synthetic Data
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach
Characterizing Model Robustness via Natural Input Gradients
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis
Hetecooper: Feature Collaboration Graph for Heterogeneous Collaborative Perception
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders
Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders
On Pretraining Data Diversity for Self-Supervised Learning
DATENeRF: Depth-Aware Text-based Editing of NeRFs
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks
Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation
Soft Prompt Generation for Domain Generalization
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
The Hard Positive Truth about Vision-Language Compositionality
m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
BLINK: Multimodal Large Language Models Can See but Not Perceive
Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
GiT: Towards Generalist Vision Transformer through Universal Language Interface
SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing
Scaling Backwards: Minimal Synthetic Pre-training?
Rethinking Image Super Resolution from Training Data Perspectives
Object-Centric Diffusion for Efficient Video Editing
SIGMA: Sinkhorn-Guided Masked Video Modeling
GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features
Quantized Prompt for Efficient Generalization of Vision-Language Models
uCAP: An Unsupervised Prompting Method for Vision-Language Models
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence
Improving Knowledge Distillation via Regularizing Feature Direction and Norm
DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images
SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks
Event Trojan: Asynchronous Event-based Backdoor Attacks
Data Poisoning Quantization Backdoor Attack
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes
Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM
Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats
Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis
City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web
Generalizable Human Gaussians for Sparse View Synthesis
Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images
SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization
End-to-End Rate-Distortion Optimized 3D Gaussian Representation
DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting
Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting
POCA: Post-training Quantization with Temporal Alignment for Codec Avatars
TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation
Shapefusion: 3D localized human diffusion models
Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution
Open-World Dynamic Prompt and Continual Visual Representation Learning
MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes
RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion
T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy
PSALM: Pixelwise Segmentation with Large Multi-modal Model
Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains
Multi-modal Relation Distillation for Unified 3D Representation Learning
Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM
Zero-Shot Multi-Object Scene Completion
Single-Photon 3D Imaging with Equi-Depth Photon Histograms
Leveraging scale- and orientation-covariant features for planar motion estimation
TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly
Human Pose Recognition via Occlusion-Preserving Abstract Images
RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark
6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry
On the Utility of 3D Hand Poses for Action Recognition
Towards Stable 3D Object Detection
Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction
ADMap: Anti-disturbance Framework for Vectorized HD Map Construction
CarFormer: Self-Driving with Learned Object-Centric Representations
Visual Relationship Transformation
Un-EVIMO: Unsupervised Event-based Independent Motion Segmentation
Local All-Pair Correspondence for Point Tracking
REDIR: Refocus-free Event-based De-occlusion Image Reconstruction
Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising
Teach CLIP to Develop a Number Sense for Ordinal Regression
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
Self-Supervised Audio-Visual Soundscape Stylization
Photorealistic Video Generation with Diffusion Models
Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation
Text2Place: Affordance-aware Text Guided Human Placement
Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
EraseDraw : Learning to Insert Objects by Erasing Them from Images
ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation
Label-free Neural Semantic Image Synthesis
Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer.
Context Diffusion: In-Context Aware Image Generation
SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models
Large-scale Reinforcement Learning for Diffusion Models
Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling
Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis
AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation
L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model
LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement
Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal
OAPT: Offset-Aware Partition Transformer for Double JPEG Artifacts Removal
DualDn: Dual-domain Denoising via Differentiable ISP
Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation
GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields
Personalized Privacy Protection Mask Against Unauthorized Facial Recognition
An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers
DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video
SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders
Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast
Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment
Data Collection-free Masked Video Modeling
Self-supervised visual learning from interactions with objects
Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning
Sequential Representation Learning via Static-Dynamic Conditional Disentanglement
Video Question Answering with Procedural Programs
ViLA: Efficient Video-Language Alignment for Video Question Answering
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
Situated Instruction Following
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images
Common Sense Reasoning for Deep Fake Detection
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
Conceptual Codebook Learning for Vision-Language Models
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
Do Generalised Classifiers really work on Human Drawn Sketches?
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
Discovering Unwritten Visual Classifiers with Large Language Models
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
LiteSAM is Actually what you Need for segment Everything
CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection
DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement
Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation
Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation
Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation
Shifted Autoencoders for Point Annotation Restoration in Object Counting
Rectify the Regression Bias in Long-Tailed Object Detection
Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning
Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation
Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification
GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes
BugNIST - a Large Volumetric Dataset for Detection under Domain Shift
AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset
Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions
Robustness Tokens: Towards Adversarial Robustness of Transformers
Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models
LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration
Improving Hyperbolic Representations via Gromov-Wasserstein Regularization
External Knowledge Enhanced 3D Scene Generation from Sketch
SCOD: From Heuristics to Theory
SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning
Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation
Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision
Resilience of Entropy Model in Distributed Neural Networks
Generalized Coverage for More Robust Low-Budget Active Learning
Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning
UNIC: Universal Classification Models via Multi-teacher Distillation
Distributed Active Client Selection With Noisy Clients Using Model Association Scores
FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning
Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data
WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
Anytime Continual Learning for Open Vocabulary Classification
COD: Learning Conditional Invariant Representation for Domain Adaptation Regression
On the Topology Awareness and Generalization Performance of Graph Neural Networks
Model Stock: All we need is just a few fine-tuned models
A Direct Approach to Viewing Graph Solvability
ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
Gaussian Grouping: Segment and Edit Anything in 3D Scenes
Shape from Heat Conduction
PAV: Personalized Head Avatar from Unstructured Video Collection
NICP: Neural ICP for 3D Human Registration at Scale
High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering
Image Demoireing in RAW and sRGB Domains
Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy
NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis
2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction
Deep Polarization Cues for Single-shot Shape and Subsurface Scattering Estimation
Look Around and Learn: Self-Training Object Detection by Exploration
6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model
EpipolarGAN: Omnidirectional Image Synthesis with Explicit Camera Control
Segmentation-guided Layer-wise Image Vectorization with Gradient Fills
Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval
Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees
Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting
High-Fidelity Modeling of Generalizable Wrinkle Deformation
Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos
Seeing Faces in Things: A Model and Dataset for Pareidolia
Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder
ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency
MAD-DR: Map Compression for Visual Localization with Matchness Aware Descriptor Dimension Reduction
Tensorial template matching for fast cross-correlation with rotations and its application for tomography
Hyperion – A fast, versatile symbolic Gaussian Belief Propagation framework for Continuous-Time SLAM
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views
CliffPhys: Camera-based Respiratory Measurement using Clifford Neural Networks
Image-to-Lidar Relational Distillation for Autonomous Driving Data
Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene
LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping
Learning to Drive via Asymmetric Self-Play
Event-based Mosaicing Bundle Adjustment
DIM: Dyadic Interaction Modeling for Social Behavior Generation
Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation
Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation
How Video Meetings Change Your Expression
Human Motion Forecasting in Dynamic Domain Shifts: A Homeostatic Continual Test-time Adaptation Framework
Towards Open Domain Text-Driven Synthesis of Multi-Person Motions
Scaling Up Personalized Image Aesthetic Assessment via Task Vector Customization
MegaScenes: Scene-Level View Synthesis at Scale
FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models
Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation
Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
Robust-Wide: Robust Watermarking against Instruction-driven Image Editing
ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion
AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion
Towards Architecture-Agnostic Untrained Networks Priors for Image Reconstruction with Frequency Regularization
CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems
Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations
Plug-and-Play Learned Proximal Trajectory for 3D Sparse-View X-Ray Computed Tomography
Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model
Spiking Wavelet Transformer
Finding a needle in a haystack: A Black-Box Approach to Invisible Watermark Detection
Noise-assisted Prompt Learning for Image Forgery Detection and Localization
Oulu Remote-photoplethysmography Physical Domain Attacks Database (ORPDAD)
Affine steerers for structured keypoint description
A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation
ZeroI2V: Zero-Cost Adaptation of Pre-Trained Transformers from Image to Video
SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow
Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization
Two-Stage Active Learning for Efficient Temporal Action Segmentation
MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
Text-Guided Video Masked Autoencoder
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos
Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning
Uni3DL: A Unified Model for 3D Vision-Language Understanding
TrajPrompt: Aligning Color Trajectory with Vision-Language Representations
Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification
OmniSat: Self-Supervised Modality Fusion for Earth Observation
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
Generalizing to Unseen Domains via Text-guided Augmentation
R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations
Fully Authentic Visual Question Answering Dataset from Online Communities
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning
Kalman-Inspired Feature Propagation for Video Face Super-Resolution
Diffusion-Guided Weakly Supervised Semantic Segmentation
Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs
Better Call SAL: Towards Learning to Segment Anything in Lidar
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models
Plain-Det: A Plain Multi-Dataset Object Detector
PDT Uav Target Detection Dataset for Pests and Diseases Tree
FreeAugment: Data Augmentation Search Across All Degrees of Freedom
CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation
Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model
MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks
Learning Quantized Adaptive Conditions for Diffusion Models
BKDSNN: Enhancing the Performance of Learning-based Spiking Neural Networks Training with Blurred Knowledge Distillation
FairViT: Fair Vision Transformer via Adaptive Masking
PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning
Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection
Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration
An accurate detection is not all you need to combat label noise in web-noisy datasets
Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay
CLEO: Continual Learning of Evolving Ontologies
Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning
CriSp: Leveraging Tread Depth Maps for Enhanced Crime-Scene Shoeprint Matching
Federated Learning with Local Openset Noisy Labels
Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents
Causal Subgraphs and Information Bottlenecks: Redefining OOD Robustness in Graph Neural Networks
Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks
Shedding More Light on Robust Classifiers under the lens of Energy-based Models
Inter-Class Topology Alignment for Efficient Black-Box Substitute Attacks
AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models
FedHide: Federated Learning by Hiding in the Neighbors
We use cookies to store which papers have been visited.
I agree
Successful Page Load
ECCV uses cookies to remember that you are logged in. By using our websites, you agree to the placement of cookies.
Our Privacy Policy »
Accept Cookies