Skip to yearly menu bar Skip to main content


Poster

HARIVO: Harnessing Text-to-Image Models for Video Generation

Mingi Kwon · Seoung Wug Oh · Yang Zhou · Joon-Young Lee · Difan Liu · Haoran Cai · Baqiao Liu · Feng Liu · Youngjung Uh

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Thu 3 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

We present a method to create diffusion-based Video models from pretrained Text-to-Image (T2I) models, overcoming limitations of existing methods. We propose a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. We demonstrate superior performance through extensive experiments and comparisons.

Live content is unavailable. Log in and register to view live content