Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] Allow quantization of weights without calibration data #28

Closed
mgoin opened this issue Jul 19, 2024 · 4 comments
Closed

[UX] Allow quantization of weights without calibration data #28

mgoin opened this issue Jul 19, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@mgoin
Copy link
Member

mgoin commented Jul 19, 2024

In the case of using round-to-nearest or static scaling for quantization formats like FP8 weights i.e. using QuantizationModifier on weights, there is no need for calibration data. Ideally, there should be no forward pass required at all.

Proposed UX:

from transformers import AutoModelForCausalLM
from compressed_tensors.quantization import QuantizationArgs, QuantizationType, QuantizationScheme, QuantizationStrategy
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")

FP8_W8 = QuantizationScheme(
    targets=["Linear"],
    weights=QuantizationArgs(
        num_bits=8,
        type=QuantizationType.FLOAT,
        strategy=QuantizationStrategy.TENSOR,
        symmetric=True,
        dynamic=False,
    ),
)

recipe = QuantizationModifier(
    config_groups={"group_0": FP8_W8},
    ignore=["lm_head"],
)

oneshot(
    model=model,
    recipe=recipe,
)
@mgoin mgoin added the enhancement New feature or request label Jul 19, 2024
@Satrat
Copy link
Contributor

Satrat commented Jul 19, 2024

Hey @mgoin, we do have a PR up for this in compressed-tensors: neuralmagic/compressed-tensors#108 where the weight/scale just get initialized when quantization is applied. Still some more work to do here as it doesn't account for the per-token use case that also doesn't require calibration.

@mgoin
Copy link
Member Author

mgoin commented Jul 19, 2024

Very nice, thanks @Satrat. Could you share an example of what the UX would be like? I'm not sure how quantization is applied outside of oneshot, or if it will run without the dataset argument.

@Satrat
Copy link
Contributor

Satrat commented Jul 24, 2024

@Satrat
Copy link
Contributor

Satrat commented Aug 6, 2024

Closing this out as the PRs mentioned above got merged, this feature is now in main

@Satrat Satrat closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants