SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

This repository is an official implementation for:

SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

Authors: Huan-ang Gao*, Mingju Gao*, Jiaju Li, Wenyi Li, Rong zhi, Hao Tang, Hao Zhao

Introduction

Semantic image synthesis (SIS) shows promising potential for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and an innovative spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K.

Environment Setup

Use conda to create a new virtual enviroment. We use torch==2.2.0+cu121.

conda env create -f environment.yaml
conda activate scp-diff

Dataset Preparation

We have prepared dataloaders for Cityscapes, ADE20K, and COCO-Stuff datasets in the directory datasets/.

Cityscapes

You can download cityscapes dataset from https://www.cityscapes-dataset.com/.

The file organization of the Cityscapes dataset is as follows:

├── cityscapes
│   ├── leftImg8bit
│   │   ├── train
│   │   ├── val
│   │   ├── test
│   ├── gtFine
│   │   ├── train
│   │   ├── val
│   │   ├── test
...

class CityscapesDataset(Dataset):
    RESOLUTION = (512, 1024)
    BASE = '/path/to/your/cityscapes/dataset'

    def __init__(self):
        super().__init__()
    ...

You need to set the root directory of the cityspaes dataset and specify the resolution for generated images by modifying the BASE and RESOLUTION. We use 512*1024 as default.

ADE20K

You can download ade20k dataset from https://groups.csail.mit.edu/vision/datasets/ADE20K/.

The file organization of the ade20k dataset is as follows:

├── ade20k
│   ├── annotations
│   │   ├── training # Gray-level annotations
│   │   ├── ...
│   ├── images
│   │   ├── training # RGB Images
│   │   ├── ...
...

class ADE20KDataset(Dataset):
    BASE = '/path/to/your/ade20k/dataset'
    RESOLUTION = (512, 512) 

    def __init__(self):
        super().__init__()

You need to set the root directory of the ade20k dataset and specify the resolution for generated images by modifying the BASE and RESOLUTION. We use 512*512 as default.

COCO-Stuff Dataset

You can download coco-stuff dataset from https://github.com/nightrome/cocostuff

The file organization of the coc-stuff dataset is as follows:

├── coco-stuff
│   ├── annotations
│   │   ├── train2017 # Gray-scale annotations
│   │   ├── ...
│   ├── images
│   │   ├── train2017 # RGB Images
│   │   ├── ...
...

class CocostuffDataset(Dataset):
    BASE = '/path/to/your/coco-stuff/dataset/'
    RESOLUTION = (384, 512) 

    def __init__(self):

You need to set the root directory of the coco-stuff dataset and specify the resolution for generated images by modifying the BASE and RESOLUTION. We use 384*512 as default.

Custome Your Own Dataset

You can custom your own dataset and write the dataloader following datasets/*.py You should have the original images, the gray-scale semantic maps and color-coded semantic maps. See examples from the above three datasets.

Results

Method		CS			ADE20K
	mIoU	Acc	FID	mIoU	Acc	FID
Normal	23.35	65.14	94.14	20.73	61.14	20.58
Class Prior	11.63	66.86	94.54	21.86	66.63	16.56
Spatial Prior	12.83	66.77	94.29	20.86	64.46	16.03
Joint Prior	10.53	67.92	94.65	25.61	71.79	12.66

We provide a download link for our finetuned ControlNets and reduced statistics in our [huggingface] page.

For inference, we need to use categorial and spatial prior. You can run noise_prior_inference.py:

CUDA_VISIBLE_DEVICES=gpu_ids \
    python3 noise_prior_inference.py --dataset cityscapes/ade20k/coco-stuff --sample_size <inference samples> --diffusion_steps <ddpm steps> --seed 4 --save_dir /path/to/save/infer/results \
    --batch_size <batch size> \
    --resolution <h> <w> --ckpt /path/to/ckpt \
    --stat_dir /path/to/statistics --stat_name <save_name of statistics> --scale <guidance scale>

where sample_size denotes the total sample size, ddpm steps ranges from 0 to 999, resolution should be compatible with the ControlNet finetuning process.

For classifier-free guidance scale, we recommend you to set the scale to 4.0 for cityscapes dataset, and 7.5 for ade20k and cocostuff dataset.

You can refer to noise_prior_inference.py for more details.

Finetuning ControlNets

Run tutorial_train.py for training.

python tutorial_train.py --batch_size <batch_size> --dataset <your dataset> --default_root_dir <log_root_dir> --gpus gpu_ids --resume_path <resume file path>

Reducing Statistics

We can aggregate gather statistical information on the dataset's categorical and spatial information.

We can run noise_prior.py:

CUDA_VISIBLE_DEVICES=gpu_ids \
    python3 noise_prior.py --dataset ade20k --sample_size 10000 --save_name ade20k_10000 \
    --ckpt /path/to/ckpt --resolution 512 512 --save_dir ./statistics

You can refer to noise_prior.py for more details.

Citation

If you find this work useful for your research, please cite our paper:

@article{gao2024scp,
  title={SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior},
  author={Gao, Huan-ang and Gao, Mingju and Li, Jiaju and Li, Wenyi and Zhi, Rong and Tang, Hao and Zhao, Hao},
  journal={arXiv preprint arXiv:2403.09638},
  year={2024}
}

Acknowledgement

We build our codebase on ControlNet, a neural network structure to control diffusion models by adding extra conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
annotator		annotator
cldm		cldm
datasets		datasets
font		font
images		images
ldm		ldm
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
environment.yaml		environment.yaml
noise_prior.py		noise_prior.py
noise_prior_inference.py		noise_prior_inference.py
share.py		share.py
tool_add_control.py		tool_add_control.py
tool_add_control_sd21.py		tool_add_control_sd21.py
tool_transfer_control.py		tool_transfer_control.py
tutorial_train.py		tutorial_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

Introduction

Environment Setup

Dataset Preparation

Cityscapes

ADE20K

COCO-Stuff Dataset

Custome Your Own Dataset

Results

Finetuning ControlNets

Reducing Statistics

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

AIR-DISCOVER/SCP-Diff-Toolkit

Folders and files

Latest commit

History

Repository files navigation

SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

Introduction

Environment Setup

Dataset Preparation

Cityscapes

ADE20K

COCO-Stuff Dataset

Custome Your Own Dataset

Results

Finetuning ControlNets

Reducing Statistics

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages