ECCV Poster FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Poster

FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Hang Hua · Jing Shi · Kushal Kafle · Simon Jenni · Daoan Zhang · John Collomosse · Scott Cohen · Jiebo Luo

[ Abstract ] [ Paper PDF ]

[ Supplemental]

2024 Poster

Abstract:

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs’ compositionality for aspect-based fine-grained text and image matching. In this task, the models need to predict the mismatched aspect phrases, identify the class of the aspect, and suggest their corrections for a given image and a text caption with 0 to 3 mismatched aspects. To evaluate the models’ performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis as we might have expected. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.

Chat is not available.