Skip to yearly menu bar Skip to main content


Poster

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Xiangxiang Chu · Jianlin Su · Bo Zhang · Chunhua Shen

[ ]
Tue 1 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA family of models stand out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA has exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision understanding and generation.

Live content is unavailable. Log in and register to view live content