Skip to yearly menu bar Skip to main content


Poster

Understanding Multi-compositional learning in Vision and Language models via Category Theory

Sotirios Panagiotis Chytas · Hyunwoo J. Kim · Vikas Singh

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Tue 1 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

The widespread use of pre-trained large language models (and multi-modal models) has led to powerful performance across a wide range of tasks. Despite their effectiveness, we have limited knowledge of their internal knowledge representation. Motivated by and using the classic problem of Compositional Zero-Shot Learning (CZSL) as an example, we provide a structured view of the latent space that any general model (LLM or otherwise) should nominally respect. Based on this view, we first provide a practical solution to the CZSL problem based on a cross-attention mechanism that can deal with both Open and Closed-World single-attribute compositions as well as multi-attribute compositions with relative ease. In all three tasks, our approach yields performance competitive with methods designed solely for that task (i.e., adaptations to other tasks are difficult). Then, we extend this perspective to existing LLMs and ask to what extent they satisfy our axiomatic definitions. Our analysis shows a mix of interesting and unsurprising findings, but nonetheless suggests that our criteria is meaningful and may yield a more structured approach to training such models, strategies for additional data collection, and diagnostics beyond visual inspection.

Live content is unavailable. Log in and register to view live content