Skip to yearly menu bar Skip to main content


Poster

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Dogucan Yaman · Fevziye Irem Eyiokur Yaman · Leonard Bärmann · HAZIM KEMAL EKENEL · Alexander Waibel

# 344
[ ] [ Project Page ] [ Paper PDF ]
Thu 3 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Talking face generation aims to create a realistic video with accurate lip synchronization and high visual quality, using given audio and reference video, while preserving identity and visual characteristics. In this paper, we start by identifying several issues of existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss and SyncNet. We further tackle lip leaking problem from the identity reference and propose a silent-lip generator, aiming to prevent lip leaking by changing the lips of the identity reference. We then introduce stabilized synchronization loss and AVSyncNet to alleviate the problems caused by lip-sync loss and SyncNet. Finally, we present adaptive triplet loss to enhance visual quality and apply a post-processing technique to obtain high-quality videos. According to the experiments, our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions as well as their complementary effects.

Chat is not available.