Skip to yearly menu bar Skip to main content


Oral Session

Oral 6A: Generative Models II

Moderators: Niculae Sebe · Vicky Kalogeiton

Thu 3 Oct 4:30 a.m. PDT — 6:30 a.m. PDT
Abstract:
Chat is not available.

Thu 3 Oct. 4:30 - 4:40 PDT

Award Candidate
Controlling the World by Sleight of Hand

Sruthi Sudhakar · Ruoshi Liu · Basile Van Hoorick · Carl Vondrick · Richard Zemel

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform fine-grained object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose learning to model interactions through a novel form of visual conditioning: hands. Hands are a natural way to specify control through actions such as grasping, pulling, pushing, etc. Given an input image and a representation of a hand interacting with the scene, our approach, CoSHAND, synthesizes a depiction of what the scene would look like after the interaction has occurred. We show that CoSHAND is able to recover the dynamics of manipulation by learning from large amounts of unlabeled videos of human hands interacting with objects, and leveraging internet-scale latent diffusion model priors. The model demonstrates strong capabilities on a variety of actions and object types beyond the dataset, and the ability to generate multiple possible futures depending on the actions performed. CoSHAND is also able to generalize zero-shot to tasks where the agent is a robot arm rather than a human hand. Our hand-conditioned model has several exciting applications in robotic planning and augmented or virtual reality.

Thu 3 Oct. 4:40 - 4:50 PDT

Pyramid Diffusion for Fine 3D Large Scene Generation

Yuheng Liu · Xinke Li · Xueting Li · Lu Qi · Chongshou Li · Ming-Hsuan Yang

Diffusion models have shown remarkable results in generating 2D images and small-scale 3D objects. However, their application to the synthesis of large-scale 3D scenes has been rarely explored. This is mainly due to the inherent complexity and bulky size of 3D scenery data, particularly outdoor scenes, and the limited availability of comprehensive real-world datasets, which makes training a stable scene diffusion model challenging. In this work, we explore how to effectively generate large-scale 3D scenes using the coarse-to-fine paradigm. We introduce a framework, the Pyramid Discrete Diffusion model (PDD), which employs scale-varied diffusion models to progressively generate high-quality outdoor scenes. Experimental results of PDD demonstrate our successful exploration in generating 3D scenes both unconditionally and conditionally. We further showcase the data compatibility of the PDD model, due to its multi-scale architecture: a PDD model trained on one dataset can be easily fine-tuned with another dataset. The source codes and trained models will be made available to the public.

Thu 3 Oct. 4:50 - 5:00 PDT

FMBoost: Boosting Latent Diffusion with Flow Matching

Johannes Schusterbauer-Fischer · Ming Gui · Pingchuan Ma · Nick Stracke · Stefan Andreas Baumann · Tao Hu · Bjorn Ommer

Visual synthesis has recently seen significant leaps in performance, inparticular due to breakthroughs in generative models. Diffusion models have been a key enabler as they excel in image diversity. This, however, comes at the prize of slow training and synthesis which are only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate that introducing flow matching between a frozen diffusion model and convolutional decoder enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then provide the necessary visual diversity effectively, while flow matching efficiently enhances resolution and details by mapping the small to a high-dimensional latent space. These latents are then projected to high-resolution images by the subsequent convolutional decoder of the diffusion approach. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, achieves state-of-the-art high-resolution image synthesis at $1024^2$ pixels with minimal computational cost. Cascading our model optionally boosts this further to $2048^2$ pixels. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.

Thu 3 Oct. 5:00 - 5:10 PDT

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

Shaozhe Hao · Kai Han · Zhengyao Lv · Shihao Zhao · Kwan-Yee K. Wong

While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task.

Thu 3 Oct. 5:10 - 5:20 PDT

Exact Diffusion Inversion via Bidirectional Integration Approximation

Guoqiang Zhang · j.p. lewis · W. Bastiaan Kleijn

Recently, various methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT [39] and Null-text inversion [23]. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, named bi-directional integration approximation (BDIA), to perform exact diffusion inversion with negligible computational overhead. Suppose we would like to estimate the next diffusion state z{i-1} at timestep ti with the historical information (i, zi) and (i+1, z{i+1}). We first obtain the estimated Gaussian noise epsilon(zi, i), and then apply the DDIM update procedure twice for approximating the ODE integration over the next time-slot [ti, t{i-1}] in the forward manner and the previous time-slot [ti, t{t+1}] in the backward manner. The DDIM step for the previous time-slot is used to refine the integration approximation made earlier when computing zi. A nice property of BDIA-DDIM is that the update expression for z{i-1} is a linear combination of (z{i+1}, zi, epsilon(zi, i)). This allows for exact backward computation of z{i+1} given (zi, z_{i-1}), thus leading to exact diffusion inversion. We perform a convergence analysis for BDIA-DDIM that includes the analysis for DDIM as a special case. It is demonstrated with experiments that BDIA-DDIM is effective for (round-trip) image editing. Our experiments further show that BDIA-DDIM produces markedly better image sampling quality than DDIM and EDICT for text-to-image generation and conventional image sampling.

Thu 3 Oct. 5:20 - 5:30 PDT

Tackling Structural Hallucination in Image Translation with Local Diffusion

Seunghoi Kim · Chen Jin · Tom Diethe · Matteo Figini · Henry FJ Tregidgo · Asher Mullokandov · Philip A Teare · Daniel Alexander

Recent developments in diffusion models have advanced conditioned image generation, yet they struggle with reconstructing out-of-distribution (OOD) images, such as unseen tumors, causing 'image hallucination' and risking misdiagnosis. We hypothesize such hallucinations result from local OOD regions in the conditional images. By partitioning the OOD region and conducting separate generations, hallucinations can be alleviated, and we verify this with motivational studies in several applications. From this, we propose a training-free diffusion framework that reduces hallucination by performing multiple \textit{Local Diffusion} processes. Our approach involves OOD estimation followed by two diffusion modules: a 'branching' module for local image generations from OOD estimations, and a 'fusion' module to integrate these predictions into a full image cohesively. These modules adapt to each testing dataset by updating an auxiliary classifier. Our evaluation shows our method improves baseline models quantitatively and qualitatively across different datasets. It also works well with various pre-trained diffusion models as a plug-and-play option.

Thu 3 Oct. 5:30 - 5:40 PDT

Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

Sojin Lee · Dogyun Park · Inho Kong · Hyunwoo J. Kim

Recent studies on inverse problems have proposed posterior samplers that leverage the pre-trained diffusion models as a powerful prior. The attempts have paved the way for using diffusion models in a wide range of inverse problems. However, the existing methods entail computationally demanding iterative sampling procedures and optimize a separate solution for each measurement, which leads to limited scalability and lack of generalization capability across unseen samples. To address these limitations, we propose a novel approach, Diffusion prior-based Amortized Variational Inference (DAVI) that solves inverse problems with a diffusion prior from an amortized variational inference perspective. Specifically, instead of the separate measurement-wise optimization, our amortized inference learns a function that directly maps measurements to the implicit posterior distributions of corresponding clean data, enabling a single-step posterior sampling even for unseen measurements. The proposed method learns the function by minimizing the Kullback-Leibler divergence between the implicit distributions and the true posterior distributions with multiple measurements using objectives derived based on variational inference. Extensive experiments across three image restoration tasks, e.g., Gaussian deblur, 4x super-resolution, and box inpainting with two benchmark datasets, demonstrate our superior performance over strong diffusion model-based methods.

Thu 3 Oct. 5:40 - 5:50 PDT

Adversarial Diffusion Distillation

Axel Sauer · Dominik Lorenz · Andreas Blattmann · Robin Rombach

We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1--4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs, Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.

Thu 3 Oct. 5:50 - 6:00 PDT

Arc2Face: A Foundation Model for ID-Consistent Human Faces

Foivos Paraperas Papantoniou · Alexandros Lattas · Stylianos Moschoglou · Jiankang Deng · Bernhard Kainz · Stefanos Zafeiriou

This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.

Thu 3 Oct. 6:00 - 6:10 PDT

Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning

Jinglin Liang · Jin Zhong · Hanlin Gu · Zhongqi Lu · Xingxing Tang · Gang Dai · Shuangping Huang · Lixin Fan · Qiang Yang

Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL. Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the model. To address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each category, searching the corresponding input conditions for each category within the model's input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier's domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Extensive experiments demonstrate that our method significantly outperforms existing baselines.

Thu 3 Oct. 6:10 - 6:20 PDT

OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

Runyi Li · Xuhan SHENG · Weiqi Li · Jian Zhang

Omnidirectional images (ODIs) are commonly used in real-world visual tasks, and high-resolution ODIs help improve the performance of related visual tasks. Most existing super-resolution methods for ODIs use end-to-end learning strategies, resulting in inferior realness of generated images and a lack of effective out-of-domain generalization capabilities in training methods. Image generation methods represented by diffusion model provide strong priors for visual tasks and have been proven to be effectively applied to image restoration tasks. Leveraging the image priors of the Stable Diffusion (SD) model, we achieve omnidirectional image super-resolution with both fidelity and realness, dubbed as OmniSSR. Firstly, we transform the equirectangular projection (ERP) images into tangent projection (TP) images, whose distribution approximates the planar image domain. Then, we use SD to iteratively sample initial high-resolution results. At each denoising iteration, we further correct and update the initial results using the proposed Octadecaplex Tangent Information Interaction (OTII) and Gradient Decomposition (GD) technique to ensure better consistency. Finally, the TP images are transformed back to obtain the final high-resolution results. Our method is zero-shot, requiring no training or fine-tuning. Experiments of our method on two benchmark datasets demonstrate the effectiveness of our proposed method.