MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning [WACV 2025]
This repository contains the official PyTorch implementation of MegaFusion: https://arxiv.org/abs/2408.11001/
Project Page
- [2024.11] Code of MegaFusion has been released.
- [2024.10] MegaFusion has been accepted to WACV 2025.
- [2024.9] A new version of the paper has been updated.
- [2024.8] Our pre-print paper is released on arXiv.
- Python >= 3.8 (Recommend to use Anaconda or Miniconda)
- PyTorch >= 2.1.0
- xformers == 0.0.27
- diffusers == 0.31.0
- accelerate == 0.17.1
- transformers == 4.46.3
- triton == 2.3.1
A suitable conda environment named megafusion
can be created and activated with:
conda env create -f environment.yaml
conda activate megafusion
Since our MegaFusion is designed to extend existing diffusion-based text-to-image models towards higher-resolution generation. We provide the offical MegaFusion implementations on several representative models, including StableDiffusion, StableDiffusion-XL, StableDiffusion-3, DeepFloyd, ControlNet, and IP-Adapter.
First, please download pre-trained StableDiffusion-1.5 from SDM-1.5.
Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./SDM-MegaFusion/ckpt/stable-diffusion-v1-5/
.
Run the inference demo with:
cd SDM-MegaFusion
CUDA_VISIBLE_DEVICES=0 accelerate launch inference.py --if_dilation False --if_reschedule False
Taking computational overhead into consideration, we only use SDXL-base, and discard SDXL-refiner in our project.
First, please download pre-trained StableDiffusion-XL from SDXL-base.
Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./SDXL-MegaFusion/ckpt/stable-diffusion-xl-base-1.0/
.
Note: Since the VAE of this ckpt does not support float16 inference, we recommend that you download the fixed version of VAE checkpoint from sdxl-vae-fp16-fix to support generation under float16.
Run the inference demo with:
cd SDXL-MegaFusion
CUDA_VISIBLE_DEVICES=0 accelerate launch inference.py --if_dilation False --if_reschedule False
First, please download pre-trained StableDiffusion-3-medium from SD3.
Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./SD3-MegaFusion/ckpt/stable-diffusion-3-medium/
Run the inference demo with:
cd SD3-MegaFusion
CUDA_VISIBLE_DEVICES=0 python inference.py
First, please download pre-trained StableDiffusion-1.5 from SDM-1.5.
Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./ControlNet-MegaFusion/ckpt/stable-diffusion-v1-5/
.
Moreover, download pre-trained ControlNet from ControlNet, and place it into ./ControlNet-MegaFusion/ckpt/sd-controlnet-canny/
Run the inference demo with:
cd ControlNet-MegaFusion
CUDA_VISIBLE_DEVICES=0 python inference.py
First, please download pre-trained StableDiffusion-1.5 from SDM-1.5.
Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./IP-Adapter-MegaFusion/ckpt/stable-diffusion-v1-5/
.
Moreover, download pre-trained IP-Adapter from IP-Adapter, and place it into ./IP-Adapter-MegaFusion/ckpt/
.
Run the inference demo with:
cd IPAdapter-MegaFusion
CUDA_VISIBLE_DEVICES=0 python inference.py
Our main experiments are conducted on the commonly used MS-COCO dataset, you can download it from MS-COCO.
We have provided the indices of subset we used in ./evaluation/MS-COCO_10K.json
.
We use the commonly used FID and KID as main evalutaion metrics. Besides, to quantitatively evaluate the semantic correctness of the synthesized results, we also utilize several language-based scores in our work, including CLIP-T CIDEr, Meteor, and ROUGE.
Run the evluation demo with:
cd evaluation
CUDA_VISIBLE_DEVICES=0 python evaluation.py
Note: You should first extract the test set, and place the ground truth images in one folder, the generated images in the other folder, and the synthesized textual captions in another folder (in the format of .txt for each image).
In our project, we use a state-of-the-art open-sourced VLM, MiniGPT-v2 to give a caption for each synthesized image, to further evaluate the semantic correctness of higher-resolution generation.
Please git clone the MiniGPT-v2 repository and place it to ./MiniGPT-v2, download the required checkpoints, and use our script ./evaluation/MiniGPT-v2/caption.py to generate description. Since the environment and dependencies may be different or conflict with our MegaFusion, we recommend you to install another virtual environment to use this function.
Run the inference demo with:
cd evaluation/MiniGPT-v2
CUDA_VISIBLE_DEVICES=0 python caption.py
- Release Paper
- Code of SDM-MegaFusion
- Code of SDXL-MegaFusion
- Code of SD3-MegaFusion
- Code of IPAdapter-MegaFusion
- Code of ControlNet-MegaFusion
- Image Caption Code of MiniGPT-v2
- Evaluation Code
If you use this code for your research or project, please cite:
@InProceedings{wu2024megafusion,
author = {Wu, Haoning and Shen, Shaocheng and Hu, Qiang and Zhang, Xiaoyun and Zhang, Ya and Wang, Yanfeng},
title = {MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2025},
}
Many thanks to the code bases from diffusers, SimpleSDM, SimpleSDXL, and SimpleSDM-3.
If you have any questions, please feel free to contact haoningwu3639@gmail.com or shenshaocheng@sjtu.edu.cn.