Poster
GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
Manu S Pillai · Mamshad Nayeem Rizve · Shah Mubarak
# 172
Strong Double Blind |
Cross-view video geo-localization (CVGL) aims to obtain a GPS trajectory of a street-view video by matching it with reference aerial-view images. Despite exhibiting promising performance, current CVGL methods face key limitations. They often rely on camera intrinsic and odometry information, utilize context from multiple frames to obtain frame-level features, leading to high computational overhead, and generate temporally inconsistent GPS trajectories by independently retrieving each street-view frame. To address these challenges, in this work, we propose TransCVGL, the first fully transformer-based method for cross-view video geo-localization. We hypothesize that video geo-localization does not require complex temporal modeling, unlike other common video understanding tasks such as action recognition. Instead, we demonstrate that the representations from a street-view geo-localization model can be efficiently aggregated to obtain video-level representation. To achieve this, we propose a transformer-adapter module, GeoAdapter, to aggregate image-level representations of an image geo-localization model and to adapt it to video inputs. Furthermore, to ensure temporally consistent GPS predictions, we introduce TransRetriever, the first transformer-based approach that models independent frame retrievals through an auto-regressive transformer decoder. Finally, we validate the efficacy of our method through extensive experiments, showcasing state-of-the-art performance in benchmark datasets.