Shaojin Wu,1 Fei Ding,1,* Mengqi Huang,1,2 Wei Liu,1 Qian He1
1 ByteDance Inc. 2 University of Science and Technology of China
We propose VMix, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. VMix outperforms other state-of-the-art methods and is flexible enough to be applied to community modules (e.g., LoRA, ControlNet, and IPAdapter) for better visual performance without retraining.
Qualitative comparison between results with VMix(on the right) and without VMix(on the left)
Aesthetic Fine-grained Control For more visual results, go checkout our Project PageWe will open source this project as soon as possible. Thank you for your patience and support! 🌟
- Release arXiv paper. Check the details here.
- Release inference code(Coming soon).
- Release model checkpoints.
- Release ComfyUI node.
If VMix is helpful, please help to ⭐ the repo.
If you find this project useful for your research, please consider citing our paper:
@misc{wu2024vmix,
title={VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control},
author={Shaojin Wu and Fei Ding and Mengqi Huang and Wei Liu and Qian He},
year={2024},
eprint={2412.20800},
archivePrefix={arXiv},
primaryClass={cs.CV}
}