Skip to yearly menu bar Skip to main content


Poster

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Dahyun Kang · Minsu Cho

# 45
Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Paper PDF ]
Thu 3 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

Vision-and-language foundation models have shown impressive hallmarks on zero-shot image classification, where the target classes are represented in text descriptions with no labeled image examples. Recent work spans such powerful image and text correspondence to open-vocabulary segmentation, \ie, predicting pixel and text correspondence without pixel-level supervision on the unseen target classes. Plenty of the previous art casts this task as pixel-to-text classification without the goal of comprehending objects within an image. We believe segmentation is a visual understanding task and advocate decoupling segmentation from visual grounding. To this end, we introduce Lazy Visual Grounding for zero-shot open-vocabulary segmentation. Lazy visual grounding first discovers distinguishable visual units as object masks with iterative graph cuts and then assigns text on the discovered visual objects in a late interaction manner. Our model is training-free yet shows great performance on four public datasets: Pascal VOC, COCO-object, COCO-stuff, and ADE 20K, and especially, demonstrates visually appealing segmentation results, indicating the model capability to comprehend visual objectness. Code and data will be released once accepted.

Chat is not available.