Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Examples ] E2E Examples #5

Merged
merged 30 commits into from
Jul 2, 2024
Merged

[ Examples ] E2E Examples #5

merged 30 commits into from
Jul 2, 2024

Conversation

robertgshaw2-neuralmagic
Copy link
Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic commented Jun 25, 2024

SUMMARY:

  • Added examples where user controls dataset preprocessing
  • Added examples with leading models
  • W8A8 Channelwise Weights, Dynamic Per Token Example (Llama-3-8B-Instruct) - GPTQ and SmoothQuant
  • W4A16 G=128 Weights Example (Llama-3-8B-Instruct) - GPTQ
  • READMES for the full user flow

FOLLOW UP PRs:

  • Updated W4A16 example to use act_order=True (once supported)
  • Migrate W8A8 to code based modifiers

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title [ Examples ] W8A8 and W4A16 Examples [ Examples ] E2E Examples Jun 27, 2024
examples/quantization/example-w4a16.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Satrat Satrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 143 files changed here, could you break out the removal of the copyright into a separate PR?

examples/quantization/llama7b_fp8_quantization.py Outdated Show resolved Hide resolved
Comment on lines 95 to 100
oneshot(
model=model, dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will same an uncompressed copy of the model, if that isn't desired we should set ouput_dir=SAVE_DIR here then lines 103-105 will overwrite the uncompressed model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to not save via output_dir

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set output_dir to None explicitly, by default it saves to "./output"

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

There are 143 files changed here, could you break out the removal of the copyright into a separate PR?

Im not seeing the 143 files?

@Satrat
Copy link
Contributor

Satrat commented Jul 2, 2024

There are 143 files changed here, could you break out the removal of the copyright into a separate PR?

Im not seeing the 143 files?

Nevermind, when I reviewed it was showing removal of all the copyright headers as changes, just checked again and its fixed

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

@bfineran @Satrat

Examples ready to go. Once we cleanup the defaults for W8A8 INT8, I can update the recipe accordingly

We first select the quantization algorithm.

In our case, we will apply the default GPTQ recipe for `int4` (which uses static group size 128 scales) to all linear layers.
> See the `Recipes` documentation for more information on making complex recipes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a link here or does this documentation not exist yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exist yet

With the dataset ready, we will now apply quantization.

We first select the quantization algorithm. In our case, we will apply the default recipe for `fp8` (which uses static-per-tensor weights and static-per-tensor activations) to all linear layers.
> See the `Recipes` documentation for more information on making complex recipes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exist yet

* Quantize the weights to 8 bits with channelwise scales using GPTQ
* Quantize the activations with dynamic per token strategy

> See the `Recipes` documentation for more information on recipes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not exist yet

Copy link
Contributor

@Satrat Satrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will we need to re-run the evals after the preset schemes land?

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

LGTM, will we need to re-run the evals after the preset schemes land?

The evals take 10s so easy to do. The whole flow end to end is 10min on an H100

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

@Satrat how do I run the linting?

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just not a fan of the current modifier scheme syntax i.e. recipe = QuantizationModifier(targets="Linear", scheme="FP8", ignore=["lm_head"]), I would like to import and use python objects directly for the schemes for easy lookup in source.

examples/quantization_w4a16/README.md Outdated Show resolved Hide resolved
robertgshaw2-neuralmagic and others added 2 commits July 2, 2024 18:13
Co-authored-by: Michael Goin <michael@neuralmagic.com>
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic merged commit d746398 into main Jul 2, 2024
5 of 8 checks passed
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic deleted the rs/examples branch July 2, 2024 18:14
markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024
* draft

* add memoryless

* run bin.quant

* before tests, correctness verified

* specify sparszoo version

* remove sparsezoo

Co-authored-by: Benjamin Fineran <benjaminfineran@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants