Skip to yearly menu bar Skip to main content


Poster

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Bu Jin · Yupeng Zheng · Pengfei Li · Weize Li · Yuhang Zheng · Sujie Hu · Xinyu Liu · Jinwei Zhu · Zhijie Yan · Haiyang Sun · Kun Zhan · Peng Jia · Xiaoxiao Long · Yilun Chen · Hao Zhao

[ ]
Fri 4 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

3D dense captioning stands as a cornerstone in comprehensively scene understanding by explicit natural language, it has seen remarkable achievements recently in indoor scenes. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1. the domain gap between indoor and outdoor scenes, such as sparse visual inputs and dynamics, making it difficult to directly transfer existing methods; 2. the lack of data with descriptive 3D-Language pair annotations specifically tailored for outdoor scenes. Hence, we introduce the new task of outdoor 3D dense captioning. As input, we assume a point cloud of a LiDAR swept 3D scene along with a set of RGB images captured by ego-camera. To address this task, we propose TOD^3Cap network, leveraging the BEV representation to encode sparse outdoor scenes, and then combine Relation Q-Former with LLaMA-Adapter to capture spatial relationships and generate rich concept descriptions in the open-world outdoor environment. We also introduce the TOD^3Cap dataset, the first million-scale effort to jointly perform 3D object detection and captioning in outdoor scenes, containing 2.3M descriptions of 64.3k outdoor objects from 850 scenes in nuScenes. Notably, ours TOD^3Cap network can effectively localize and describe 3D objects in outdoor, which outperforms indoor baseline methods by a significant margin (+9.76\% CiDEr@0.5IoU)

Live content is unavailable. Log in and register to view live content