Skip to yearly menu bar Skip to main content


Unifying 3D Vision-Language Understanding via Promptable Queries

ziyu zhu · Zhuofan Zhang · Xiaojian Ma · Xuesong Niu · Yixin Chen · Baoxiong Jia · Zhidong Deng · Siyuan Huang · Qing Li

[ ]
Wed 2 Oct 1:30 a.m. PDT — 3:30 a.m. PDT


A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying different 3D scene representations (\ie, voxels, point clouds, multi-view images) into a common 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates superior performance on most tasks, setting new records on most benchmarks. Particularly, PQ3D boosts the state-of-the-art on ScanNet200 by 1.8% (AP), ScanRefer by 5.4% (acc@0.5), Multi3DRef by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with whatever 3D representations are available, e.g., solely relying on voxels.

Live content is unavailable. Log in and register to view live content