Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MOE Quantization] Warn against "undercalibrated" modules #20

Merged
merged 4 commits into from
Jul 11, 2024

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Jul 10, 2024

Note: this branch requires this PR: neuralmagic/compressed-tensors#46 to land in compressed-tensors.

Example Use:

from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization.gptq import GPTQModifier

# Sets parameters for the GPTQ algorithms - target Linear layer weights at 4 bits
recipe = GPTQModifier(scheme="FP8", targets="Linear", ignore=["lm_head"])

# Apply GPTQ algorithm using open_platypus dataset for calibration.
oneshot(
    model="Isotonic/TinyMixtral-4x248M-MoE",
    dataset="open_platypus",
    recipe=recipe,
    save_compressed=True,
    output_dir="llama-compressed-quickstart",
    overwrite_output_dir=True,
    max_seq_length=128,
    num_calibration_samples=2,
)
...
2024-07-10T21:33:18.273943+0200 | _build_quant_modifier | INFO - Building quantization modifier with args: {'targets': 'Linear', 'scheme': 'FP8', 'ignore': ['lm_head']}
2024-07-10T21:33:18.299325+0200 | _calibrate | INFO - Running QuantizationModifier calibration with 2 samples...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.53s/it]
2024-07-10T21:33:27.368876+0200 | _check_token_distribution | WARNING - The module_name: model.layers.0.block_sparse_moe.experts.1.w1 received less than 20% of calibration batch tokens (35/256 tokens). This could result may harm the quantization quality.
2024-07-10T21:33:27.369534+0200 | _check_token_distribution | WARNING - The module_name: model.layers.0.block_sparse_moe.experts.1.w2 received less than 20% of calibration batch tokens (35/256 tokens). This could result may harm the quantization quality.
2024-07-10T21:33:27.369577+0200 | _check_token_distribution | WARNING - The module_name: model.layers.0.block_sparse_moe.experts.1.w3 received less than 20% of calibration batch tokens (35/256 tokens). This could result may harm the quantization quality.
...

@dbogunowicz dbogunowicz changed the title Update base.py [MOE Quantization] Warn against "undercalibrated" modules Jul 10, 2024
@dbogunowicz dbogunowicz requested review from Satrat and bfineran July 10, 2024 19:35
@dbogunowicz dbogunowicz merged commit 7d9c643 into main Jul 11, 2024
8 of 12 checks passed
markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024
* test forward (vllm-project#16)

* test frozen (vllm-project#17)

* test frozen

* rename

* lifecycle conftest (vllm-project#21)

* test initalize (vllm-project#18)

* test initalize

* newline

* parametrize weights and inp_act

* remove dup

* test lifecycle (vllm-project#19)

* test lifecycle

* comments

* comments

* add quantization test

* Lifecycle/min max obs (vllm-project#20)

* min max test

* add minmax obs

* test scale range and min_max update

* rebase

* rebase

* fix

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants