-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for 4bit (nf4) and 8bit bitsandbytes quantization (3/3) #151
base: main
Are you sure you want to change the base?
Conversation
Prevents slow CPU initialization of model weights on load by using accelerate `init_empty_weights`. Completely compatible with from_pretrained since weights will always be overwritten by state_dict fixes VectorSpaceLab#72
I find that removing |
@Rypo |
@staoxiao Absolutely, happy to lend a hand! I dug a little deeper into the I assume that isn't the behavior you're after here, but at worst it would probably just cause a small performance hit. Not exactly sure why As for the You'll also need to take care to always synchronize before accessing any tensors moved via |
@Pevernow I uploaded the weights to the hub (4bit, 8bit). Note: These links may change depending on the specifics of the final implementation. I'll update this message if so. Update: they work out of the box with this PR now. See updated "Usage" section above. |
Add a quantization utility for HFQuantizers. Modify pipelines to accept quantization_config. Sets ground work for allow bf16 vae. Update requirements to include bitsandbytes. closes VectorSpaceLab#45, closes VectorSpaceLab#64
accf137
to
8ea2d6d
Compare
Looking forward to integration of quantized weights. Thank you |
…ion for bnb quant dict
…ocessing, skip quant norm layers
…rom_pretrained Removes quantization_config from main pipeline. Instead, use Diffusers style syntax where the config is passed to the transformer (model) when is then passed to the pipeline.
Alright, I think it's in an acceptable state at this point. Barring any glaring issues I missed, New Changes Recap
Final Remarks
If anyone finds an issue let me know, otherwise enjoy! Update: I failed to call it a wrap. Colab painfully slow -> |
Appreciate your efforts. |
Adds a small utility to scheduler to find the minimum clip bound to prevent NaNs from popping out of the decoder layers. Search over hardcoded buffer to discard as little information as possible. Phi3Transformer now raises OverflowError when NaNs encountered. Initialize model dtype based on actual weight value to avoid bad casts when quantized.
UpdateSome good news for the GPU poor! My most recent commit (d75af76) appears to have patched the float16 issue #108. Turns out the decoder layers were operating on values outside the bounds of what fp16 can handle causing numerical overflow. Luckily, clipping the values into an operable range doesn't seem to degrade the quality too much. Why does float16 matter?
The GoodsI uploaded fp16-compatible 4-bit weights to the hub: gryan/OmniGen-v1-fp16-bnb-4bit With these weights, you can comfortably run double 1024x1024 images on free-tier Colab, without model offloading. model = OmniGen.from_pretrained('gryan/OmniGen-v1-fp16-bnb-4bit', dtype=torch.float16)
pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1", model = model) For maximum comfort
For maximum speed
It's still not fast, but it's a decent step up from the 25-60 min it took for a double 768x768 previously. Enjoy! |
Start search with minimal clipping value found through testing (2^16 - 3*32). This value was sufficient for all tested inputs. Further analysis still required to guarantee that it will always be sufficient in all cases.
Changes
quantization_config
arg tofrom_pretrained
toOmniGen
. Addsmodel
,vae
,processor
kwargs toOmniGenPipeline
.OmniGen
expects atransformers.BitsAndBytesConfig
which can then be passed toOmniGenPipeline
bitsandbytes==0.44.1
Usage
or to use pre-quantized weights:
Important
If you are using Google Colab free-tier or have an older GPU use the float16 weights. You'll go OOM or get errors using the default bfloat16 weights.
For use with
app.py
you can pass a cli arg--nbits
or-b
Ideally, this would be a gradio radio button component or something, but that's a task for another day.
Results
Following a similar format to the Different inference settings table.
For 4bit-nf4 quantized model on RTX 3090 GPU(24G):
Testing setup
max_memory_allocated()
,max_memory_reserved()
was typically ~3GB higher.Image Comparisons
8-bit
I didn't spend much time testing 8-bit, but without cache offloading -> OOM. Here's a couple samples otherwise.
For bnb 8bit quantized model on RTX 3090 GPU(24G):
Images
Same prompts + settings as above, all with 8-bit quantization.
Additional Considerations
bnb_4bit_quant_type='fp4'
- same vRAM, same timings, worse quality images.bnb_4bit_compute_dtype=torch.bfloat16
- very poor quality images._This is the third of 3 PRs I'm issuing to improve performance/fix errors. I've tried to keep each incremental change as small in scope as possible. PRs: 1. #149, 2. #150, 3. This
Update (2024-12-02):
Update (2024-12-05):
Update (2024-12-12):