Poster
SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis
Huan-ang Gao · Mingju Gao · Jiaju Li · Wenyi Li · Rong Zhi · Hao Tang · HAO ZHAO
# 294
Semantic image synthesis (SIS) shows promising potential for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its image-level control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the root of these problems as a mismatch between the training-noised data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K. The code and models will be made publicly available.