Add SD3 Pipeline #329

ZachNagengast · 2024-06-12T15:27:02Z

SD3 on Core ML 🎉

Brought to Apple Silicon by your friends at @argmaxinc

Paper: https://stability.ai/news/stable-diffusion-3-research-paper

What's new:

StableDiffusion3Pipeline
- Main entry point with standard protocol usage
MultiModalDiffusionTransformer (MMDiT)
- The latest and greatest in diffusion technology from StabilityAI, it utilizes a new architecture and several new supporting models
TextEncoderT5
- Optional text encoder for additional prompt understanding
- Includes T5Tokenizer code from huggingface/swift-transformers
DecoderSD3
- This new VAE has 16 channels, up from 4 with previous models
DiscreteFlowScheduler
- A new scheduler that uses shifting to achieve better denoising at high resolutions

How to use it:

For the models that didn't change, the existing conversion pipelines should all work as is:

python -m python_coreml_stable_diffusion.torch2coreml --convert-text-encoder --xl-version --model-version stabilityai/stable-diffusion-xl-base-1.0 --bundle-resources-for-swift-cli --attention-implementation ORIGINAL -o <output-dir>

We also created an entire repo dedicated to the new models called DiffusionKit and comes with conversion pipelines for the new VAE and MMDiT models

To install:

git clone https://github.com/argmaxinc/DiffusionKit.git
cd DiffusionKit
pip install -e .

Convert MMDiT:

python -m tests.torch2coreml.test_mmdit --sd3-ckpt-path <path-to-sd3-mmdit.safetensors> --model-version {2b} -o <output-mlpackages-directory> --latent-size {64, 128}

Convert VAE:

python -m tests.torch2coreml.test_vae --sd3-ckpt-path <path-to-sd3-mmdit.safetensors> -o <output-mlpackages-directory> --latent-size {64, 128}

Finally, combine all of these models into the same folder and point this CLI to the path they are in to test it out with the new cli flag --sd3:

swift run StableDiffusionSample <prompt> --resource-path <output-mlmodelc-directory/Resources> --output-path <output-dir> --sd3

You should see a new image in your output-dir that might look something like this:

Try it out today via this PR into Huggingface's excellent swift-coreml-diffusers app (included pre-converted models and pipeline usage example)

Co-authored-by: atiorh <atiorh@users.noreply.github.com> Co-authored-by: arda-argmax <arda-argmax@users.noreply.github.com>

msiracusa · 2024-06-27T04:44:08Z

Thank you for opening this PR and adding support for Stable Diffusion 3!

Two high level topics I think will be important to cover here are:

Keeping a unified interface / entry point for model conversion
Reducing swift code duplication across the pipelines and model interfaces

Reviewers have been assigned and will provide more detailed feedback

alejandro-isaza

My main concerns are:

T5Tokenizer.swift it too long, please break out.
Too many new public types, let's try to keep the public interface small.
Some duplication. I know some of it needs a larger refactor, but there are some easy wins here.

alejandro-isaza · 2024-06-27T15:48:19Z

swift/StableDiffusion/pipeline/DecoderSD3.swift

+    public func decode(
+        _ latents: [MLShapedArray<Float32>],
+        scaleFactor: Float32,
+        shiftFactor: Float32


As far as I can tell, shiftFactor is the only difference between Decoder and DecoderSD3. Let's add the shift to Decoder and default it to 0.

This comment still stands, let's reuse Decoder instead of introducing DecoderSD3

alejandro-isaza · 2024-06-27T15:50:31Z