Poster
VEON: Vocabulary-Enhanced Occupancy Prediction
Jilai Zheng · Pin Tang · Zhongdao Wang · Guoqing Wang · Xiangxuan Ren · Bailan Feng · Chao Ma
# 238
Strong Double Blind |
Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any type of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces coarse relative depth and fails to estimate the one-hot bin depth. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class-reweighting strategy to give priority to the tail classes. With only 46.2M trainable parameters and no manual labels, VEON achieves 15.14 mIoU on Occ3D-NuScenes, and also shows the capability of recognizing objects with open-vocabulary categories, demonstrating that our VEON is label and parameter efficient, and precise enough.