Poster
Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation
Peixi Xiong · Michael A Kozuch · Nilesh Jain
# 325
Strong Double Blind |
Text-to-image generation plays a pivotal role in computer vision and natural language processing by translating textual descriptions into visual representations. However, understanding complex relations in detailed text prompts filled with rich relational content remains a significant challenge. To address this, we introduce a novel task: Logic-Rich Text-to-Image (LRT2I) generation. Unlike conventional image generation tasks that rely on short and structurally simple natural language inputs, our task focuses on intricate text inputs abundant in relational information. To tackle these complexities, we collect the Textual-Visual Logic (TV-Logic) dataset, designed to evaluate the performance of text-to-image generation models across diverse and complex scenarios. Furthermore, we propose a baseline model as a benchmark for this task. Our model comprises three key components: a negative pair discriminator, a relation understanding module, and a multimodality fusion module. These components enhance the model's ability to handle disturbances in informative tokens and prioritize relational elements during image generation.