[MOE Quantization] Warn against "undercalibrated" modules #20

dbogunowicz · 2024-07-10T18:54:04Z

Note: this branch requires this PR: neuralmagic/compressed-tensors#46 to land in compressed-tensors.

Example Use:

from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization.gptq import GPTQModifier

# Sets parameters for the GPTQ algorithms - target Linear layer weights at 4 bits
recipe = GPTQModifier(scheme="FP8", targets="Linear", ignore=["lm_head"])

# Apply GPTQ algorithm using open_platypus dataset for calibration.
oneshot(
    model="Isotonic/TinyMixtral-4x248M-MoE",
    dataset="open_platypus",
    recipe=recipe,
    save_compressed=True,
    output_dir="llama-compressed-quickstart",
    overwrite_output_dir=True,
    max_seq_length=128,
    num_calibration_samples=2,
)

...
2024-07-10T21:33:18.273943+0200 | _build_quant_modifier | INFO - Building quantization modifier with args: {'targets': 'Linear', 'scheme': 'FP8', 'ignore': ['lm_head']}
2024-07-10T21:33:18.299325+0200 | _calibrate | INFO - Running QuantizationModifier calibration with 2 samples...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.53s/it]
2024-07-10T21:33:27.368876+0200 | _check_token_distribution | WARNING - The module_name: model.layers.0.block_sparse_moe.experts.1.w1 received less than 20% of calibration batch tokens (35/256 tokens). This could result may harm the quantization quality.
2024-07-10T21:33:27.369534+0200 | _check_token_distribution | WARNING - The module_name: model.layers.0.block_sparse_moe.experts.1.w2 received less than 20% of calibration batch tokens (35/256 tokens). This could result may harm the quantization quality.
2024-07-10T21:33:27.369577+0200 | _check_token_distribution | WARNING - The module_name: model.layers.0.block_sparse_moe.experts.1.w3 received less than 20% of calibration batch tokens (35/256 tokens). This could result may harm the quantization quality.
...

* test forward (vllm-project#16) * test frozen (vllm-project#17) * test frozen * rename * lifecycle conftest (vllm-project#21) * test initalize (vllm-project#18) * test initalize * newline * parametrize weights and inp_act * remove dup * test lifecycle (vllm-project#19) * test lifecycle * comments * comments * add quantization test * Lifecycle/min max obs (vllm-project#20) * min max test * add minmax obs * test scale range and min_max update * rebase * rebase * fix * fix

dbogunowicz added 4 commits July 10, 2024 20:53

Update base.py

b0f98a4

Update data_args.py

42dac6c

Update base.py

4562617

Update session_mixin.py

bc20726

dbogunowicz changed the title ~~Update base.py~~ [MOE Quantization] Warn against "undercalibrated" modules Jul 10, 2024

dbogunowicz requested review from Satrat and bfineran July 10, 2024 19:35

Satrat approved these changes Jul 11, 2024

View reviewed changes

dbogunowicz merged commit 7d9c643 into main Jul 11, 2024
8 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MOE Quantization] Warn against "undercalibrated" modules #20

[MOE Quantization] Warn against "undercalibrated" modules #20

dbogunowicz commented Jul 10, 2024 •

edited

Loading

[MOE Quantization] Warn against "undercalibrated" modules #20

[MOE Quantization] Warn against "undercalibrated" modules #20

Conversation

dbogunowicz commented Jul 10, 2024 • edited Loading

Example Use:

dbogunowicz commented Jul 10, 2024 •

edited

Loading