Layer by Layer Sequential GPTQ Updates #47

Satrat · 2024-08-01T15:30:01Z

SUMMARY:
Our previous sequential implementation of GPTQ ran calibration forward passes over the whole model when compressing each layer. We now instead calibrate (and compress) one transformer layer at a time. This requires us to cache the intermediate outputs between each layer, which equates to (hidden_size*calibration_samples*max_calibration_sequence_length). These intermediate outputs are stored on CPU, and moved to GPU one by one during calibration so there is no extra GPU memory required.

Added an EarlyStopException for capturing the intermediate output of the model at the start of the decoder layers
Instead of calling a full forward calibration pass in GPTQModifier, we run all the calibration data through one layer at a time (implemented in LayerCompressor.calibrate_layer(), caching the intermediate outputs

This update supports model offloading for sequential runs. It also has multi-gpu support: set with either device_map="auto" or with calculate_offload_device_map(num_gpus=....). The latter option is recommended as it takes into account the memory required to store the hessians and quantization information when assigning devices and cpu offloading

TEST PLAN:
Updated w8a8 big model example to use the new sequential flow, tested with 8n and 70b.

python examples/big_model_offloading/big_model_w8a8_calibrate.py

Sequential now takes 21min for 8b W8A8 on an A100 (for reference non-sequential was 19min). vLLM eval results on gsm8k for the 8b example, results are equivalent to the non-sequential run

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.744|±  |0.0277|
|     |       |strict-match    |     5|exact_match|↑  |0.744|±  |0.0277|

NOTE: requires neuralmagic/compressed-tensors#120 be merged first (DONE)

…-compressor into sequential_gptq_updates

robertgshaw2-redhat · 2024-08-12T16:00:24Z

Wow! This is an awesome result!

abhinavnmagic

The proposed changes worked well for 70b, 450b - scale models.

* group size * add logic in base observer * Compressed lifecycle implementation (INT8 only) * group size full lifecycle run * Apply suggestions from code review * before vectorize the for loop * comments, todo add channelwise * chan wise impl * comments * fix channel wise * comments, validators * fix typo * small fixes for runtime * add classes * tensor return error fix * WIP * moving around classes * fix sparseml-side of code and add per channel * pyndatic defaults * token wise quant * Update src/compressed_tensors/quantization/quant_args.py Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> * comments' * code complete * tests passing * unit test bugs * fill out int decompression * docstrings * allow repeat frozens * update dim * int compressor unit tests * move helper * shape consistency * initial commit * first unit test passing * Update src/compressed_tensors/quantization/lifecycle/forward.py Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> * comments * tests passing * one more test * cleanup * pass test_quant_args * Quantization Compressor Support (vllm-project#45) * add classes * WIP * moving around classes * code complete * tests passing * unit test bugs * fill out int decompression * docstrings * allow repeat frozens * int compressor unit tests * PR comments * fix device issue * fixing leaf checker * updating tests * docstrings * updating examples * update examples * fix channelwise * new tests, some fail * WIP * new helper fn * actually just a warning * group size speedups + fixes * group compression * fix output type on decompress * fix channelwise * revert * more tests * move tests * example notebook * add example notebook * update README * cleanup --------- Co-authored-by: George Ohashi <george@neuralmagic.com> Co-authored-by: Benjamin <ben@neuralmagic.com> Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

Sara Adkins added 12 commits August 1, 2024 14:17

WIP

2cbb965

cleanup, disable cache

73216ef

docstrings

86679f5

move to layer compressor

8eae951

small fixes

b4c4995

docstrings

1e0c851

fix offloading

894f323

remove example debugging

c0b1ea6

fix structure check

c36f7b6

update example back to 70b

337da02

remove debugging code

7892799

remove debug code

85bc447

Satrat changed the title ~~[WIP] Layer by Layer Sequential GPTQ Updates~~ Layer by Layer Sequential GPTQ Updates Aug 2, 2024

Sara Adkins added 3 commits August 2, 2024 17:28

fix unit tests

51086bf

update example

d5e8fd4

Merge branch 'sequential_gptq_updates' of github.com:vllm-project/llm…

b78cb9a

…-compressor into sequential_gptq_updates

Satrat mentioned this pull request Aug 2, 2024

Fix Execution Device Helper Fn neuralmagic/compressed-tensors#120

Merged

Sara Adkins added 2 commits August 2, 2024 21:10

update example

c8217a4

reert test change

19c1de1

Satrat marked this pull request as ready for review August 2, 2024 21:12

Satrat requested review from bfineran, anmarques, rahul-tuli and abhinavnmagic August 2, 2024 21:12

bfineran approved these changes Aug 5, 2024

View reviewed changes

Sara Adkins added 3 commits August 5, 2024 16:14

Merge branch 'main' into sequential_gptq_updates

6119fe4

example tweak

015935c

Merge branch 'sequential_gptq_updates' of github.com:vllm-project/llm…

3fda6bd

…-compressor into sequential_gptq_updates

Satrat requested a review from dsikka August 5, 2024 20:11

Merge branch 'main' into sequential_gptq_updates

9b585cd

Merge branch 'main' into sequential_gptq_updates

040b27b

abhinavnmagic approved these changes Aug 12, 2024

View reviewed changes

Satrat merged commit d74c2cf into main Aug 12, 2024
8 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer by Layer Sequential GPTQ Updates #47

Layer by Layer Sequential GPTQ Updates #47

Satrat commented Aug 1, 2024 •

edited

Loading

robertgshaw2-redhat commented Aug 12, 2024

abhinavnmagic left a comment

Layer by Layer Sequential GPTQ Updates #47

Layer by Layer Sequential GPTQ Updates #47

Conversation

Satrat commented Aug 1, 2024 • edited Loading

robertgshaw2-redhat commented Aug 12, 2024

abhinavnmagic left a comment

Choose a reason for hiding this comment

Satrat commented Aug 1, 2024 •

edited

Loading