Skip to yearly menu bar Skip to main content


Poster

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Heeseung Yun · Ruohan Gao · Ishwarya Ananthabhotla · Anurag Kumar · Jacob Donley · Chao Li · Gunhee Kim · Vamsi Krishna Ithapu · Calvin Murdock

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Fri 4 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to the wearer’s behaviors. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of the wearer’s head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a world-locked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive image-to-sphere projections. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including active speaker localization in noisy conversations, audio-based spherical sound source localization, and behavior anticipation in everyday activities.

Live content is unavailable. Log in and register to view live content