Poster
Stable Video Portraits
Mirela Ostrek · Justus Thies
# 214
Strong Double Blind |
Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today, especially considering the impeccable images of synthesized lifelike faces. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D single-frame video generation method that outputs photorealistic videos of talking faces leveraging the large pre-trained text-to-image prior Stable Diffusion (2D), controlled via a temporal sequence of 3DMMs (3D). To bridge the gap between 2D and 3D, the 3DMM conditionings are projected onto a 2D image plane and person-specific finetuning is performed using a single identity video via ControlNet. For higher temporal stability, we introduce a novel inference procedure that considers the previous frame during the denoising process. Furthermore, a technique to morph the facial appearance of the person-specific avatar into a text-defined celebrity, without test-time finetuning, is demonstrated. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods. Code and data will be released.