[REQUEST] Is it possible and a lot of trouble to support flux? #631

Ph0rk0z · 2024-09-22T14:34:38Z

Problem

Flux is a transformers based image model. It's rather large and fills a whole 24g card. People have made GGUF, bitsnbytes and NF4 loaders for comfyui which all use those LLM quantizations. Seemingly with little modification. I recently found a marlin implementation too: https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ

Solution

Even though it's not an LLM, the model was shoehorned into several LLM only quants. The comfyui nodes that load them don't seem super complicated, but I'm not familiar with the entire codebase to know if it's a big ask architecture wise or even something you're interested in.

Alternatives

The other quants leave much to be desired, they either quantize too much or don't perform very well. GGUF is slower than the native torch FP8 quantization. While using exl2 has been suggested, nobody has asked.

Explanation

It would make flux fast and it's something new.

Examples

No response

Additional context

No response

Acknowledgements

I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.

Downtown-Case · 2024-09-23T02:47:17Z

+1

To add to this, it could potentially make exllama and tabbyAPI the "production" backend of Flux, right? There's no analogue to vllm for flux.

Downtown-Case · 2024-09-23T02:49:29Z

...But another thing to note is that half of flux's vram usage is the T5 encoder, which also quantizes fairly poorly, and I think that alone would be a large endeavor for exllama to support. Most backends are going to just swap it in/out, and supporting easy swapping in exllama may also be a tricky endeavor.

turboderp · 2024-09-23T10:53:52Z

Everything is possible, but it's definitely "a lot of trouble", yes. It would be a completely different pipeline and very much outside of what ExLlama currently does, which is language modeling.

You could possibly do something with the transformer component, but even then it'd be a different quantization objective than next-token prediction so this would really make more sense as a standalone project.

Ph0rk0z · 2024-09-23T11:39:06Z

The AWQ guy used the marlin kernel's matmul. https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ/blob/main/q_awq_marlin_loader.py It would be a separate project in that it would be a comfy node, not something in tabby or exui. Latter would be a yuge ask.

I think the first hurdle is how to convert the model into the format itself due to the calibration dataset being text based. Model is basically almost all transformer layers though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Is it possible and a lot of trouble to support flux? #631

[REQUEST] Is it possible and a lot of trouble to support flux? #631

Ph0rk0z commented Sep 22, 2024

Downtown-Case commented Sep 23, 2024

Downtown-Case commented Sep 23, 2024 •

edited

Loading

turboderp commented Sep 23, 2024

Ph0rk0z commented Sep 23, 2024

[REQUEST] Is it possible and a lot of trouble to support flux? #631

[REQUEST] Is it possible and a lot of trouble to support flux? #631

Comments

Ph0rk0z commented Sep 22, 2024

Problem

Solution

Alternatives

Explanation

Examples

Additional context

Acknowledgements

Downtown-Case commented Sep 23, 2024

Downtown-Case commented Sep 23, 2024 • edited Loading

turboderp commented Sep 23, 2024

Ph0rk0z commented Sep 23, 2024

Downtown-Case commented Sep 23, 2024 •

edited

Loading