Skip to yearly menu bar Skip to main content


Poster

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

Jeongho Kim · Min-Jung Kim · Junsoo Lee · Choo Jaegul

# 292
Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ] [ Paper PDF ]
Fri 4 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

Recent advances in diffusion model has shed light on Text-to-Video(T2I) and Image-to-Video(I2V) generation. As a line of work, pose driven video generation with reference image also gained attention, showing the capability of realistic human dance synthesis. However, previous methods have some remaining challenges. First, the network that encodes the pose information is fine-tuned using the pose videos from the target domain, thus lacking generalizability to diverse poses. Second, as the models are driven by the provided pose videos, the outcomes inevitably depend on the performance of off-the-shelf pose detector. In this paper, we present pose driven video generation methods with reference image that mitigates the aforementioned issues. Unlike previous methods, we utilize the pretrained ControlNet without fine-tuning to leverage its preaquired knowledge from a vast amount of pose-image-caption pairs. To remain the controlnet frozen, we introduce a correspondence layer, enabling the network to train the correspondence between the pose and appearance features. Additionally, by introducing additional temporal layer to the ControlNet, we enhance robustness with respect to pose detector outliers. Extensive experiments demonstrate that the proposed method can achieve promising results in video synthesis tasks, encompassing various poses.

Chat is not available.