Skip to yearly menu bar Skip to main content


Poster

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra · Richa Mishra · Ziyi Chen · Boyang Ding · Renda Li · Shoulei Wang · Jun-Yan Zhu · Peng Chang · Mei Han · Jing Xiao

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Thu 3 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Co-speech gesture video generation is an enabling technique for numerous digital human applications in the post-ChatGPT era. Substantial progress has been made in creating high-quality talking head videos. However, existing hand gesture video generation methods are largely limited by the widely adopted 2D skeleton-based gesture representation, and still struggle to generate realistic hands. We propose a novel end-to-end audio-driven co-speech video generation pipeline to synthesize human speech videos leveraging 3D human mesh-based representations. By adopting a 3D human mesh-based gesture representation, we present a mesh-grounded video generator that includes a mesh texture-map optimization step followed by a new conditional GAN-based network, and outputs photorealistic gesture videos with realistic hands. Our experiments on the TalkSHOW dataset demonstrate the effectiveness of our method over a baseline that uses 2D skeleton-based representation.

Live content is unavailable. Log in and register to view live content