Poster
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
Dahyun Kang · Minsu Cho
# 45
Strong Double Blind |
Vision-and-language foundation models have shown impressive hallmarks on zero-shot image classification, where the target classes are represented in text descriptions with no labeled image examples. Recent work spans such powerful image and text correspondence to open-vocabulary segmentation, \ie, predicting pixel and text correspondence without pixel-level supervision on the unseen target classes. Plenty of the previous art casts this task as pixel-to-text classification without the goal of comprehending objects within an image. We believe segmentation is a visual understanding task and advocate decoupling segmentation from visual grounding. To this end, we introduce Lazy Visual Grounding for zero-shot open-vocabulary segmentation. Lazy visual grounding first discovers distinguishable visual units as object masks with iterative graph cuts and then assigns text on the discovered visual objects in a late interaction manner. Our model is training-free yet shows great performance on four public datasets: Pascal VOC, COCO-object, COCO-stuff, and ADE 20K, and especially, demonstrates visually appealing segmentation results, indicating the model capability to comprehend visual objectness. Code and data will be released once accepted.