-
Notifications
You must be signed in to change notification settings - Fork 312
[AWQ] Generalize AWQ quantization #1961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that so long as you feel confident that _compute_layer_means is going to work as expected for all the supported strategies, then I think this looks good to me!
kylesayrs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve from my side
fynnsu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added a couple comments below!
bdcdca4 to
4e480ce
Compare
| QuantizationMixin.initialize_quantization(self, state.model) | ||
|
|
||
| # Validate that duo_scaling is only used with per-channel quantization | ||
| if self.duo_scaling != False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylesayrs added check for duo_scaling + per Tensor strategy
| continue | ||
|
|
||
| balance_layer.weight.mul_(_scalesview) | ||
| call_observer(balance_layer, "weight", balance_layer.weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass should_calculate_gparam=scheme.strategy=="tensor_group", then I think this should be able to support nvfp4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually added todo will do in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(added todo) going to move this and simplified logic to a new PR
9d2d033 to
57ade6b
Compare
kylesayrs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, approved from my side
| sequential_targets: Union[str, List[str], None] = None | ||
| mappings: Optional[List[AWQMapping]] = None | ||
| offload_device: Optional[torch.device] = None | ||
| duo_scaling: str | bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Literal["both"] doesn't work anymore?
|
|
||
| # avoid scaling values that overflow | ||
| scales[torch.isinf(scales)] = 1 | ||
| scales[torch.isnan(scales)] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely annoying that we still have to have this line.
| .item() | ||
| ) | ||
| batch_loss = torch.nn.functional.mse_loss( | ||
| fp16_batch.to(device), int_w_batch.to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm has a lot of device movement
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
57ade6b to
157cf48
Compare
Summary
To allow for arbitrary heterogeneous quantization schemes, this PR switches several helpers from AutoAWQ to the observer and QDQ logic. AWQ no longer constrains that the quantization config needs to have the same settings for group_size, symmetric, and num_bits for each config_group.
Resolves #1657
Prerequisites:
Test plan
llm-compressor/examples/awq/llama_example.pywith this (withduo_scaling="both") and logging the best configuration of(ratio, duo_scaling), I see a good mix of Falses and Trues. i.e. a good percentage of best_scales were found with duo_scaling=False and a good percentage were found withduo_scaling=True. Generated model output looks good.awq_one_shot.py(pasted below), Wikitext PPL is consistent for w4a16 and w4a16_asym on this branch when compared to main, and better than what was reported in a previous AWQ PR, but those might have been differently configured. For W4A16_ASYM, the results are both 13.41 for main and this branch. This is what we've been historically using to test regressions.CADENCE=weekly TEST_DATA_FILE=~/projects/llm-compressor/tests/lmeval/configs/w4a16_awq_sym.yaml pytest -s ~/projects/llm-compressor/tests/lmeval/test_lmeval.pyon this branch, which causes the test to fail. This persists even when usingpseudo_quantize_tensorinstead ofcall_observer/forward_quantize, as shown in this diff. I get the same result in this diff, so at least that means quantization logic in CT is consistent with AutoAWQOutput:
This is already a pretty high drop in recovery, should we revisit this test?
Further regression testing against main was done in this commit see run.sh as of that commit which was removed in the final PR. Results look reasonable comparing branch and main, some up some down, within margin of error.
Test Group Quantization (w4a16_awq_sym)
Test Tensor Quantization (int8_tensor)
Test Channel Quantization (fp8_dynamic)
Test Block Quantization (fp8_block)
awq_oneshot.py script
```python import osos.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from llmcompressor import oneshot, active_session
from llmcompressor.utils import dispatch_for_generation
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationScheme,
QuantizationStrategy,
QuantizationType,
)
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"
Configure the quantization algorithm to run.
recipe = [
AWQModifier(
ignore=[
"lm_head",
"re:.*mlp.gate$",
"re:.mlp.shared_expert_gate$",
"re:visual.",
],
scheme="W4A16_ASYM",
duo_scaling="both",
targets=["Linear"],
# offload_device=torch.device("cpu"),
),
]
Select calibration dataset.
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"
Select number of samples. 256 samples is a good place to start.
Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
def get_calib_dataset(tokenizer):
from datasets import load_dataset
if name == "main":
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)