Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layer by Layer Sequential GPTQ Updates #47

Merged
merged 22 commits into from
Aug 12, 2024
Merged

Conversation

Satrat
Copy link
Contributor

@Satrat Satrat commented Aug 1, 2024

SUMMARY:
Our previous sequential implementation of GPTQ ran calibration forward passes over the whole model when compressing each layer. We now instead calibrate (and compress) one transformer layer at a time. This requires us to cache the intermediate outputs between each layer, which equates to (hidden_size*calibration_samples*max_calibration_sequence_length). These intermediate outputs are stored on CPU, and moved to GPU one by one during calibration so there is no extra GPU memory required.

  • Added an EarlyStopException for capturing the intermediate output of the model at the start of the decoder layers
  • Instead of calling a full forward calibration pass in GPTQModifier, we run all the calibration data through one layer at a time (implemented in LayerCompressor.calibrate_layer(), caching the intermediate outputs

This update supports model offloading for sequential runs. It also has multi-gpu support: set with either device_map="auto" or with calculate_offload_device_map(num_gpus=....). The latter option is recommended as it takes into account the memory required to store the hessians and quantization information when assigning devices and cpu offloading

TEST PLAN:
Updated w8a8 big model example to use the new sequential flow, tested with 8n and 70b.

python examples/big_model_offloading/big_model_w8a8_calibrate.py

Sequential now takes 21min for 8b W8A8 on an A100 (for reference non-sequential was 19min). vLLM eval results on gsm8k for the 8b example, results are equivalent to the non-sequential run

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.744|±  |0.0277|
|     |       |strict-match    |     5|exact_match|↑  |0.744|±  |0.0277|

NOTE: requires neuralmagic/compressed-tensors#120 be merged first (DONE)

@Satrat Satrat changed the title [WIP] Layer by Layer Sequential GPTQ Updates Layer by Layer Sequential GPTQ Updates Aug 2, 2024
@Satrat Satrat marked this pull request as ready for review August 2, 2024 21:12
@Satrat Satrat requested a review from dsikka August 5, 2024 20:11
@robertgshaw2-redhat
Copy link
Collaborator

Wow! This is an awesome result!

Copy link
Contributor

@abhinavnmagic abhinavnmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed changes worked well for 70b, 450b - scale models.

@Satrat Satrat merged commit d74c2cf into main Aug 12, 2024
8 of 12 checks passed
markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024
* group size

* add logic in base observer

* Compressed lifecycle implementation (INT8 only)

* group size full lifecycle run

* Apply suggestions from code review

* before vectorize the for loop

* comments, todo add channelwise

* chan wise impl

* comments

* fix channel wise

* comments, validators

* fix typo

* small fixes for runtime

* add classes

* tensor return error fix

* WIP

* moving around classes

* fix sparseml-side of code and add per channel

* pyndatic defaults

* token wise quant

* Update src/compressed_tensors/quantization/quant_args.py

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

* comments'

* code complete

* tests passing

* unit test bugs

* fill out int decompression

* docstrings

* allow repeat frozens

* update dim

* int compressor unit tests

* move helper

* shape consistency

* initial commit

* first unit test passing

* Update src/compressed_tensors/quantization/lifecycle/forward.py

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

* comments

* tests passing

* one more test

* cleanup

* pass test_quant_args

* Quantization Compressor Support (vllm-project#45)

* add classes

* WIP

* moving around classes

* code complete

* tests passing

* unit test bugs

* fill out int decompression

* docstrings

* allow repeat frozens

* int compressor unit tests

* PR comments

* fix device issue

* fixing leaf checker

* updating tests

* docstrings

* updating examples

* update examples

* fix channelwise

* new tests, some fail

* WIP

* new helper fn

* actually just a warning

* group size speedups + fixes

* group compression

* fix output type on decompress

* fix channelwise

* revert

* more tests

* move tests

* example notebook

* add example notebook

* update README

* cleanup

---------

Co-authored-by: George Ohashi <george@neuralmagic.com>
Co-authored-by: Benjamin <ben@neuralmagic.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants