This repository contains the code for the paper Diffusion-based Visual Anagram as Multi-task Learning.
Authors: Zhiyuan Xu*, Yinhe Chen*, Huan-ang Gao, Weiyan Zhao, Guiyu Zhang, Hao Zhao†
Institute for AI Industry Research (AIR), Tsinghua University
Visual anagrams are images that change appearance upon transformation, like flipping or rotation. With the advent of diffusion models, generating such optical illusions can be achieved by averaging noise across multiple views during the reverse denoising process. However, we observe two critical failure modes in this approach: (i) concept segregation, where concepts in different views are independently generated, which can not be considered a true anagram, and (ii) concept domination, where certain concepts overpower others. In this work, we cast the visual anagram generation problem in a multi-task learning setting, where different viewpoint prompts are analogous to different tasks,and derive denoising trajectories that align well across tasks simultaneously. At the core of our designed framework are two newly introduced techniques, where (i) an anti-segregation optimization strategy that promotes overlap in cross-attention maps between different concepts, and (ii) a noise vector balancing method that adaptively adjusts the influence of different tasks. Additionally, we observe that directly averaging noise predictions yields suboptimal performance because statistical properties may not be preserved, prompting us to derive a noise variance rectification method. Extensive qualitative and quantitative experiments demonstrate our method’s superior ability to generate visual anagrams spanning diverse concepts.
Algorithm overview. During each denoising step, the intermediate image
The code has been tested on Ubuntu 20.04 with CUDA 12.2 and Python 3.9. A single RTX 3090 GPU with 24GB of memory is typically sufficient to run it. For systems with GPUs that have less memory, consider setting generate_1024
to False
in conti_runner.py
to reduce memory requirements.
conda env create -f environment.yml
conda activate anagram-mtl
Our method utilizes DeepFloyd as the backbone diffusion model. Follow the instructions on their Hugging Face page to download the pretrained weights. Please note that a Hugging Face account is required.
Our proposed approach also uses a noise-aware CLIP model released by OpenAI. Open clip_guided.ipynb
and execute the second, third, and sixth cells to download the model. After downloading, a folder named glide_model_cache
will be created in the root directory of this repository, containing clip_image_enc.pt
and clip_text_enc.pt
.
To generate visual anagrams using our method, please refer to the generate.ipynb
notebook. The notebook primarily calls the generate_anagram()
function in conti_runner.py
.
style
: The style of the generated anagram.prompts
: A list of prompts used to generate the anagram.views
: A list of corresponding views for each prompt.- Possible views:
identity
,rotate_cw
,rotate_ccw
,flip
, etc.
- Possible views:
save_dir
: The directory to save the generated anagram.device
: The device to run the code. Default iscuda
.seed
: The random seed for the sampling process.num_inference_steps
: The number of inference steps for the diffusion model. Default is30
.guidance_scale
: The strength of Classifier-free Guidance. Default is10.0
.noise_level
: The noise level for the second stage of the diffusion model. Default is50
.
Altering arguments with default values should be done with caution, except for seed
.
generate_anagram(style='a charcoal drawing of',
prompts=['a leopard', 'a savannah'],
views=['identity', 'rotate_cw'],
save_dir='output',
seed=9)
If you find this repository helpful, please consider citing our paper:
@article{xu2024diffusion,
title={Diffusion-based Visual Anagram as Multi-task Learning},
author={Xu, Zhiyuan and Chen, Yinhe and Gao, Huan-ang and Zhao, Weiyan and Zhang, Guiyu and Zhao, Hao},
journal={arXiv preprint arXiv:2412.02693},
year={2024}
}
We would like to thank the developers of DeepFloyd and GLIDE for releasing their pretrained models. We also thank the authors of Visual Anagrams and Prompt-to-Prompt for open-sourcing their codebases, upon which our work is built.