Poster
Bi-directional Contextual Attention for 3D Dense Captioning
Minjung Kim · Hyung Suk Lim · Soonyoung Lee · Bumsoo Kim · Gunhee Kim
# 119
Strong Double Blind |
3D dense captioning is a task to localize objects and generate descriptions for each object in a 3D scene. Recent approaches in 3D dense captioning have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationship that exists across the entire global scene (not only near the object itself), and second, it faces with contradicting objectives--where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce CSI, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with contextualized geometries (where the structural contexts relevant to each object is summarized) and contextualized objects (where the objects relevant to the summarized structural contexts are aggregated). This simple extension relieves previous methods from the contradicting objectives, enhancing both localization performance while enabling to aggregate contextual features throughout the global scene; thus improving caption generation performance simultaneously. Extensive experiments on two of the most widely-used 3D dense captioning datasets (ScanRefer and Nr3D) demonstrate that our proposed method achieves a significant improvement over prior methods.