-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Layer by Layer Sequential GPTQ Updates #47
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…-compressor into sequential_gptq_updates
bfineran
approved these changes
Aug 5, 2024
…-compressor into sequential_gptq_updates
Wow! This is an awesome result! |
abhinavnmagic
approved these changes
Aug 12, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed changes worked well for 70b, 450b - scale models.
markmc
pushed a commit
to markmc/llm-compressor
that referenced
this pull request
Nov 13, 2024
* group size * add logic in base observer * Compressed lifecycle implementation (INT8 only) * group size full lifecycle run * Apply suggestions from code review * before vectorize the for loop * comments, todo add channelwise * chan wise impl * comments * fix channel wise * comments, validators * fix typo * small fixes for runtime * add classes * tensor return error fix * WIP * moving around classes * fix sparseml-side of code and add per channel * pyndatic defaults * token wise quant * Update src/compressed_tensors/quantization/quant_args.py Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> * comments' * code complete * tests passing * unit test bugs * fill out int decompression * docstrings * allow repeat frozens * update dim * int compressor unit tests * move helper * shape consistency * initial commit * first unit test passing * Update src/compressed_tensors/quantization/lifecycle/forward.py Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> * comments * tests passing * one more test * cleanup * pass test_quant_args * Quantization Compressor Support (vllm-project#45) * add classes * WIP * moving around classes * code complete * tests passing * unit test bugs * fill out int decompression * docstrings * allow repeat frozens * int compressor unit tests * PR comments * fix device issue * fixing leaf checker * updating tests * docstrings * updating examples * update examples * fix channelwise * new tests, some fail * WIP * new helper fn * actually just a warning * group size speedups + fixes * group compression * fix output type on decompress * fix channelwise * revert * more tests * move tests * example notebook * add example notebook * update README * cleanup --------- Co-authored-by: George Ohashi <george@neuralmagic.com> Co-authored-by: Benjamin <ben@neuralmagic.com> Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SUMMARY:
Our previous sequential implementation of GPTQ ran calibration forward passes over the whole model when compressing each layer. We now instead calibrate (and compress) one transformer layer at a time. This requires us to cache the intermediate outputs between each layer, which equates to
(hidden_size*calibration_samples*max_calibration_sequence_length)
. These intermediate outputs are stored on CPU, and moved to GPU one by one during calibration so there is no extra GPU memory required.EarlyStopException
for capturing the intermediate output of the model at the start of the decoder layersGPTQModifier
, we run all the calibration data through one layer at a time (implemented inLayerCompressor.calibrate_layer()
, caching the intermediate outputsThis update supports model offloading for sequential runs. It also has multi-gpu support: set with either
device_map="auto"
or withcalculate_offload_device_map(num_gpus=....)
. The latter option is recommended as it takes into account the memory required to store the hessians and quantization information when assigning devices and cpu offloadingTEST PLAN:
Updated w8a8 big model example to use the new sequential flow, tested with 8n and 70b.
Sequential now takes 21min for 8b W8A8 on an A100 (for reference non-sequential was 19min). vLLM eval results on gsm8k for the 8b example, results are equivalent to the non-sequential run
NOTE: requires neuralmagic/compressed-tensors#120 be merged first (DONE)