Skip to yearly menu bar Skip to main content


Poster

Learning Video Context as Interleaved Multimodal Sequences

Kevin Qinghong Lin · Pengchuan Zhang · Difei Gao · Xide Xia · Joya Chen · Ziteng Gao · Jinheng Xie · Xuhong Xiao · Mike Zheng Shou

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Wed 2 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who~\cite{autoad2}, relationship~\cite{lfvu}, and reason~\cite{movieqa}). In this paper, we introduce~{\our}, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate \our's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classifcation, audio description, video-text retrieval, video captioning, and video question-answering).

Live content is unavailable. Log in and register to view live content