Poster
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
Sun Yanan · Yanchen Liu · Yinhao Tang · Wenjie Pei · Kai Chen
# 94
The field of text-to-image (T2I) generation has made significant progress in recent years, thanks to diffusion models. Linguistic control enables effective content creation, but is defective in fine-grained control over image generation. This challenge has been solved in great extent by incorporating additional user-supplied spatial conditions like depth map, edge map into pre-trained T2I model via extra encoding. However, multi-control image synthesis still struggle with input flexibility, handling the relationship among spatial conditions, and maintaining compatibility with text inputs. To address these challenges, we propose AnyControl, a controllable image synthesis framework that supports any combination of various forms of control signals. AnyControl develops a novel multi-control encoder to extract a unified multi-modal embedding for diverse control signals used for guiding the generation process. We achieve this by employing alternate multi-control encoding scheme and multi-control alignment scheme, with learnable queries as a bridge to unite them seamlessly and gradually distill compatible information from spatial conditions guided by textual prompts. This approach enables holistic understanding of user inputs, and produces harmonious results in high quality and fidelity under versatile control signals, demonstrated by extensive quantitative and qualitative results.