Skip to yearly menu bar Skip to main content


Poster

Grounding Language Models for Visual Entity Recognition

Zilin Xiao · Ming Gong · Paola Cascante-Bonilla · Xingyao Zhang · Jie Wu · Vicente Ordonez

# 113
[ ] [ Project Page ] [ Paper PDF ]
Wed 2 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visual reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark with accuracy on the Entity seen split rising from 32.7% to 61.5%. It demonstrates superior performance on the unseen and query splits by a substantial double-digit margin, while also preserving the ability to effectively transfer to other generic visual question answering benchmarks without further training.

Chat is not available.