Skip to yearly menu bar Skip to main content


Poster

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao · Na Zhao · Jingjing Chen · Yu-Gang Jiang

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Tue 1 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Open-vocabulary 3D object detection (OV-3DDET) is a challenging task aimed at locating and recognizing objects in a 3D scene, encompassing both seen and previously unseen categories. Unlike in the vision and language domain where abundant training data is available to train generalized models, 3D detection models suffer from the scarcity of training data. Despite this challenge, the flourishing field of vision-language models (VLMs) offers valuable insights that can guide the learning process for OV-3DDET. While some efforts have been made to incorporate VLMs into OV-3DDET learning, existing methods often fall short in establishing a comprehensive association between 3D detectors and VLMs. In this paper, we investigate the utilization of VLMs for the task of open-vocabulary 3D detection. We use a vision model to guide novel class discovery in 3D scenes, and hierarchically align the 3D feature and vision-language feature space. Specifically, we employ an off-the-shelf 2D detector to seed and select novel 3D objects. The discovered novel objects are then stored for retraining the 3D detector. Finally, we align the 3D feature space with the vision-language feature space using a pre-trained vision-language model at the instance, category, and scene levels. Through extensive experimentation, we demonstrate substantial improvements in accuracy and generalization, underscoring the potential of VLMs in advancing 3D object detection for real-world applications.

Live content is unavailable. Log in and register to view live content