Skip to yearly menu bar Skip to main content


Poster

Factorizing Text-to-Video Generation by Explicit Image Conditioning

Rohit Girdhar · Mannat Singh · Andrew Brown · Quentin Duval · Samaneh Azadi · Sai Saketh Rambhatla · Mian Akbar Shah · Xi Yin · Devi Parikh · Ishan Misra

[ ]
Fri 4 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

We present FACT2V, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions–adjusted noise schedules for diffusion, and multi-stage training— that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video. Our model outperforms commercial solutions such as RunwayML’s Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work.

Live content is unavailable. Log in and register to view live content