Track: Oral 4B: Video Generation / Editing / Prediction

Wed 2 Oct. 4:30 - 4:40 PDT

Award Candidate

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

Bolin Lai · Xiaoliang Dai · Lawrence Chen · Guan Pang · James Rehg · Miao Liu

Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures the user's environment. Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, the existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap. To this end, we propose to Learn EGOcentric (LEGO) action frame generation via visual instruction tuning. First, we introduce a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning. Then we propose a novel method to leverage image and text embeddings from VLLM as additional conditioning to improve the performance of a diffusion model. We validate our model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights in our method.

Wed 2 Oct. 4:40 - 4:50 PDT

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Vikram Voleti · Chun-Han Yao · Mark Boss · Adam Letts · David Pankratz · Dmitrii Tochilkin · Christian Laforte · Robin Rombach · Varun Jampani

We present Stable Video 3D (SV3D) --- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.

Wed 2 Oct. 4:50 - 5:00 PDT

Efficient Neural Video Representation with Temporally Coherent Modulation

Seungjun Shin · Suji Kim · Dokwan Oh

Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach [26] employs grid-type trainable parameters and successfully achieves a faster encoding speed in comparison to its predecessors [5]. Despite its time efficiency, using grid-types without considering the dynamic nature of the videos results in performance limitations. To enable learning video representation rapidly and effectively, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture the dynamic characteristics by decomposing the spatio-temporal 3D video data into a set of 2D grids. Through this mapping, our framework enables to process temporally corresponding pixels at once, resulting in a more than 3× faster video encoding speed for a reasonable video quality. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG datasets (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV dataset, compared to previous work. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting.

Wed 2 Oct. 5:00 - 5:10 PDT

Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

Zhihang Zhong · Gurunandan Krishnan · Xiao Sun · Yu Qiao · Sizhuo Ma · Jian Wang

Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements. Given two images of a baseball, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly together with predicting the frames, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed "distance indexing". This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. We further observed that, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames (i.e., halfway in-between), due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly sharper outputs and superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing. Additionally, distance indexing can be specified pixel-wise, which enables temporal manipulation of each object independently, offering a novel tool for video editing tasks like re-timing.

Wed 2 Oct. 5:10 - 5:20 PDT

Video Editing via Factorized Diffusion Distillation

Uriel Singer · Amit Zohar · Yuval Kirstain · Shelly Sheynin · Adam Polyak · Devi Parikh · Yaniv Taigman

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters.

Wed 2 Oct. 5:20 - 5:30 PDT

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan · Zhiliang Xu · Hang Zhou · Kaisiyuan Wang · Shengyi He · Zhanwang Zhang · Borong Liang · Haocheng Feng · Errui Ding · Jingtuo Liu · Jingdong Wang · Youjian Zhao · Ziwei Liu

Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio in real-time, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.

Wed 2 Oct. 5:30 - 5:40 PDT

Audio-Synchronized Visual Animation

Lin Zhang · Shentong Mo · Yijing Zhang · Pedro Morgado

Current visual generation methods can produce high-quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally-synchronized image animations. We introduce Audio-Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips across multiple classes. To this end, we present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio-visual events across 15 categories. We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios. Extensive evaluations validate AVSync15 as a reliable benchmark for synchronized generation and demonstrate our model's superior performance. We further explore AVSyncD's potential in a variety of audio-synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. We hope our established benchmark can open new avenues for controllable visual generation.

Wed 2 Oct. 5:40 - 5:50 PDT

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Jinbo Xing · Menghan Xia · Yong Zhang · Haoxin Chen · Wangbo Yu · Hanyuan Liu · Gongye Liu · Xintao Wang · Ying Shan · Tien-Tsin Wong

Animating a still image offers an engaging visual experience. Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, we first project it into a text-aligned rich context representation space using a query transformer, which facilitates the video model to digest the image content in a compatible fashion. However, some visual details still struggle to be preserved in the resultant videos. To supplement with more precise image information, we further feed the full image to the diffusion model by concatenating it with the initial noises. Experimental results show that our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image. Comparative evaluation demonstrates the notable superiority of our approach over existing competitors. The source code will be released upon publication.

Wed 2 Oct. 5:50 - 6:00 PDT

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Rui Zhao · Yuchao Gu · Jay Zhangjie Wu · Junhao Zhang · Jiawei Liu · Weijia Wu · Jussi Keppo · Mike Zheng Shou

Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. Adaptation methods have been developed for customizing appearance like subject or style, yet under-explored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.

Wed 2 Oct. 6:00 - 6:10 PDT

ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model

Fu-Yun Wang · Zhaoyang Huang · Qiang Ma · Guanglu Song · Xudong LU · Weikang Bian · Yijin Li · Yu Liu · Hongsheng LI

Although video generation has made great progress in capacity and controllability and is gaining increasing attention, currently available video generation models still make minimal progress in the video length they can generate. Due to the lack of well-annotated long video data, high training/inference cost, and flaws in the model designs, current video generation models can only generate videos of $2 \sim 4$ seconds, greatly limiting their applications and the creativity of users. We present ZoLA, a zero-shot method for creative long animation generation with short video diffusion models and even with short video consistency models~(a new family of generative models known for the fast generation with top-performing quality). In addition to the extension for long animation generation~(dozens of seconds), ZoLA as a zero-shot method, can be easily combined with existing community adapters~(developed only for image or short video models) for more innovative generation results, including control-guided animation generation/editing, motion customization/alternation, and multi-prompt conditioned animation generation, etc. And, importantly, all of these can be done with commonly affordable GPU~(12 GB for 32-second animations) and inference time~(90 seconds for denoising 32-second animations with consistency models). Experiments validate the effectiveness of ZoLA, bringing great potential for creative long animation generation.

Wed 2 Oct. 6:10 - 6:20 PDT

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

Lin Zhu · Yunlong Zheng · Yijun Zhang · Xiao Wang · Lizhi Wang · Hua Huang

Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities. However, current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene, resulting in over-smoothing and blurry artifacts. Addressing this challenge necessitates the integration of conditional information, encompassing temporal features, low-frequency texture, and high-frequency events, to guide the Denoising Diffusion Probabilistic Model (DDPM) in producing accurate and natural outputs. To tackle this issue, we introduce a novel approach, the Temporal Residual Guided Diffusion Framework, which effectively leverages both temporal and frequency-based event priors. Our framework incorporates three key conditioning modules: a pre-trained low-frequency intensity estimation module, a temporal recurrent encoder module, and an attention-based high-frequency prior enhancement module. In order to capture temporal scene variations from the events at the current moment, we employ a temporal-domain residual image as the target for the diffusion model. Through the combination of these three conditioning paths and the temporal residual framework, our framework excels in reconstructing high-quality videos from event flow, mitigating issues such as artifacts and over-smoothing commonly observed in previous approaches. Extensive experiments conducted on multiple benchmark datasets validate the superior performance of our framework compared to prior event-based reconstruction methods. Our code will be released upon acceptance.

Main Navigation

Oral Session

Oral 4B: Video Generation / Editing / Prediction