This is yet another Stable Diffusion compilation, aimed to be functional, clean & compact enough for various experiments. There's no GUI here, as the target audience are creative coders rather than post-Photoshop users. For the latter one may check InvokeAI or AUTOMATIC1111 as a convenient production tool, or Deforum for precisely controlled animations.
The code is based on the CompVis and Stability AI libraries and heavily borrows from this repo, with occasional additions from InvokeAI and Deforum, as well as the others mentioned below. The following codebases are partially included here (to ensure compatibility and the ease of setup): k-diffusion, Taming Transformers, OpenCLIP, CLIPseg. There is also a similar repo, based on the [diffusers] library, which is more logical and up-to-date.
Current functions:
- Text to image
- Image re- and in-painting
- Latent interpolations (with text prompts and images)
Fine-tuning with your images:
- Add subject (new token) with textual inversion
- Add subject (prompt embedding + Unet delta) with custom diffusion
Other features:
- Memory efficient with
xformers
(hi res on 6gb VRAM GPU) - Use of special depth/inpainting and v2 models
- Masking with text via CLIPseg
- Weighted multi-prompts
- to be continued..
More details and Colab version will follow.
Install CUDA 11.6. Setup the Conda environment:
conda create -n SD python=3.10 numpy pillow
activate SD
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt
Install xformers
library to increase performance. It makes possible to run SD in any resolution on the lower grade hardware (e.g. videocards with 6gb VRAM). If you're on Windows, first ensure that you have Visual Studio 2019 installed.
pip install git+https://github.com/facebookresearch/xformers.git
Download Stable Diffusion (1.5, 1.5-inpaint, 2-inpaint, 2-depth, 2.1, 2.1-v, OpenCLIP, custom VAE, CLIPseg, MiDaS models (mostly converted to float16
for faster loading) by the command below. Licensing info is available on their webpages.
python download.py
Examples of usage:
- Generate an image from the text prompt:
python src/_sdrun.py -t "hello world" --size 1024-576
- Redraw an image with existing style embedding:
python src/_sdrun.py -im _in/something.jpg -t "<line-art>"
- Redraw directory of images, keeping the basic forms intact:
python src/_sdrun.py -im _in/pix -t "neon light glow" --model v2d
- Inpaint directory of images with RunwayML model, turning humans into robots:
python src/_sdrun.py -im _in/pix --mask "human, person" -t "steampunk robot" --model 15i
- Make a video, interpolating between the lines of the text file:
python src/latwalk.py -t yourfile.txt --size 1024-576
- Same, with drawing over a masked image:
python src/latwalk.py -t yourfile.txt -im _in/pix/bench2.jpg --mask _in/pix/mask/bench2_mask.jpg
Check other options by running these scripts with --help
option; try various models, samplers, noisers, etc.
Text prompts may include either special tokens (e.g. <depthmap>
) or weights (like good prompt :1 | also good prompt :1 | bad prompt :-0.5
). The latter may degrade overall accuracy though.
Interpolated videos may be further smoothed out with FILM.
There are also Windows bat-files, slightly simplifying and automating the commands.
- Train prompt embedding for a specific subject (e.g. cat) with textual inversion:
python src/train.py --token mycat1 --term cat --data data/mycat1
- Do the same with custom diffusion:
python src/train.py --token mycat1 --term cat --data data/mycat1 --reg_data data/cat
Results of the trainings above will be saved under train
directory.
Custom diffusion trains faster and can achieve impressive reproduction quality in the simple and similar prompts, but it can entirely lose the point if the prompt is too complex or aside from the original category. Result file is 73mb (can be compressed to ~16mb). Note that in that case you'll need both target reference images (data/mycat1
) and more random images of similar subjects (data/cat
). Apparently, you can generate the latter with SD itself.
Textual inversion is more generic but stable. Its embeddings can also be easily combined without additional retraining. Result file is ~5kb.
- Generate image with embedding from textual inversion. You'll need to rename the embedding file as your trained token (e.g.
mycat1.pt
), and point the path to its directory. Note that the token is hardcoded in the file, so you can't change it afterwards.
python src/_sdrun.py -t "cosmic <mycat1> beast" --embeds train
- Generate image with embedding from custom diffusion. You'll need to explicitly mention your new token (so you can name it differently here) and path to the trained delta file:
python src/_sdrun.py -t "cosmic <mycat1> beast" --token_mod mycat1 --delta_ckpt train/delta-xxx.ckpt
You can also run python src/latwalk.py ...
with such arguments to make animations.
It's quite hard to mention all those who made the current revolution in visual creativity possible. Check the inline links above for some of the sources. Huge respect to the people behind Stable Diffusion, InvokeAI, Deforum and the whole open-source movement.