Poster
3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance
Xiaoxu Xu · Yitian Yuan · Jinlong Li · Qiudan Zhang · Zequn Jie · Lin Ma · Hao Tang · Niculae Sebe · Xu Wang
# 70
Strong Double Blind |
3D weakly supervised semantic segmentation aims to learn semantic segmentation without using dense annotations. Previous methods mostly use Class Activation Map to solve this challenge. In such a paradigm, the model is supervised given the scene-level or subcloud-level labels, however, remaining less-explored in the potential textually semantic information from the category labels. In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-language models and proposes Embeddings Soft-Guidance Stage to utilize it to implicitly align 3D embeddings and text embeddings. Moreover, we introduce Embeddings Specialization Stage to purify the feature representation with the help of given scene-level label, specifying better feature supervised by the corresponding text embedding. Thus, the 3D model is able to gain the informative supervisions both from the image embedding and text embedding, leading to competitive segmentation performances. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels. Moreover, with extensive quantitative and qualitative experiments, we present that our 3DSS-VLG is able to not only achieve the state-of-the-art performance on both S3DIS and ScanNet dataset, but also maintain strong generalization capability.