Skip to yearly menu bar Skip to main content


Poster

Think before Placement: Common Sense Enhanced Transformer for Object Placement

Yaxuan Qin · Jiayu Xu · Ruiping Wang · Xilin Chen

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Fri 4 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Object placement is a task to insert a foreground object into a background scene at a suitable position and size. Existing methods mainly focus on extracting better visual features, while neglecting common sense about the objects and background. It leads to semantically unrealistic object positions. In this paper, we introduce Think Before Placement, a novel framework that effectively combines the implicit and explicit knowledge to generate placements that are both visually coherent and contextually appropriate. Specifically, we first adopt a large multi-modal model to generate a descriptive caption of the background (Think), then output proper position and size of the object (Place). The caption serves as an explicit semantic guidance for the subsequent placement of objects. Using this framework, we implement our model named CSENet, which outperforms baseline methods on the OPA dataset in extensive experiments. Further, we establish the OPAZ dataset to evaluate the zero-shot transfer capabilities of CSENet, where it also shows impressive zero-shot performance across different foreground objects and scenes.

Live content is unavailable. Log in and register to view live content