Skip to content

Commit

Permalink
[Bugfix] Rename files to remove colons (#846)
Browse files Browse the repository at this point in the history
* rename files to remove colons

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [Bugfix] Workaround tied tensors bug (#659)

* load offload state dict

* add test

* remove merge duplication

* prepare to fix tie_word_embeddings

* add full tests

* patch second bug

* comment out failing tests, point to next pr

* link to issue

* accomodate offloaded models in test

* add back passing test

* WIP

* add error if not in expected list

* apply style

* update passing failing list

* add shared tensors tests

* clean up

* add comment with link

* make failing tests a todo

* Remove failing tests

* explicitly set safe_serialization

* separate out gpu tests, apply style

---------

Co-authored-by: Kyle Sayers <kyle@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* only untie word embeddings (#839)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* check for config hidden size (#840)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* Use float32 for Hessian dtype (#847)

* use float32 for hessian dtype

* explicitly set inp dtype as well

* float precision for obcq hessian

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* GPTQ: Depreciate non-sequential update option (#762)

* remove from gptq, apply style

* remove instances of sequential_update argument in GPTQ tests

* update examples

* update example tests

* documentation, remove from example

* apply style

* revert back to auto type

* apply style

---------

Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* Typehint nits (#826)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [ DOC ] Remove version restrictions in W8A8 exmaple (#849)

The latest compressored-tensor 0.8.0 removed some API,
https://github.com/neuralmagic/compressed-tensors/pull/156/files
If installed the older llmcompressor from pip, it would throw the
error like:
```
ImportError: cannot import name 'update_layer_weight_quant_params' from 'compressed_tensors.quantization'
```

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* Fix inconsistence (#80)

Use group strategy with 128 group size instead of channel

Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* 2of4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* revert change to unrelated example

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* rename test file

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* fix fwd func call (#845)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Kyle Sayers <kylesayers@sophon-3.mynetworksettings.com>
Co-authored-by: Kyle Sayers <kyle@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Jincheng Miao <jincheng.miao@intel.com>
Co-authored-by: 黄石 <yzlnew@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
  • Loading branch information
6 people committed Nov 21, 2024
1 parent 36d50f2 commit fa328db
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ This example uses LLMCompressor and Compressed-Tensors to create a 2:4 sparse an
The model is calibrated and trained with the ultachat200k dataset.
At least 75GB of GPU memory is required to run this example.

Follow the steps below, or to run the example as `python examples/quantization_24_sparse_w4a16/llama7b_sparse_w4a16.py`
Follow the steps below, or to run the example as `python examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py`

## Step 1: Select a model, dataset, and recipe
In this step, we select which model to use as a baseline for sparsification, a dataset to
Expand All @@ -40,7 +40,7 @@ Models can reference a local directory, or a model in the huggingface hub.
Datasets can be from a local compatible directory or the huggingface hub.

Recipes are YAML files that describe how a model should be optimized during or after training.
The recipe used for this flow is located in [2:4_w4a16_recipe.yaml](./2:4_w4a16_recipe.yaml).
The recipe used for this flow is located in [2of4_w4a16_recipe.yaml](./2of4_w4a16_recipe.yaml).
It contains instructions to prune the model to 2:4 sparsity, run one epoch of recovery finetuning,
and quantize to 4 bits in one show using GPTQ.

Expand All @@ -56,18 +56,18 @@ model = SparseAutoModelForCausalLM.from_pretrained(
dataset = "ultrachat-200k"
splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}

recipe = "2:4_w4a16_recipe.yaml"
recipe = "2of4_w4a16_recipe.yaml"
```

## Step 2: Run sparsification using `apply`
The `apply` function applies the given recipe to our model and dataset.
The hardcoded kwargs may be altered based on each model's needs.
After running, the sparsified model will be saved to `output_llama7b_2:4_w4a16_channel`.
After running, the sparsified model will be saved to `output_llama7b_2of4_w4a16_channel`.

```python
from llmcompressor.transformers import apply

output_dir = "output_llama7b_2:4_w4a16_channel"
output_dir = "output_llama7b_2of4_w4a16_channel"

apply(
model=model,
Expand Down Expand Up @@ -98,12 +98,12 @@ run the following:
import torch
from llmcompressor.transformers import SparseAutoModelForCausalLM

compressed_output_dir = "output_llama7b_2:4_w4a16_channel_compressed"
compressed_output_dir = "output_llama7b_2of4_w4a16_channel_compressed"
model = SparseAutoModelForCausalLM.from_pretrained(output_dir, torch_dtype=torch.bfloat16)
model.save_pretrained(compressed_output_dir, save_compressed=True)
```

### Custom Quantization
The current repo supports multiple quantization techniques configured using a recipe. Supported strategies are `tensor`, `group` and `channel`.
The above recipe (`2:4_w4a16_recipe.yaml`) uses channel-wise quantization specified by `strategy: "channel"` in its config group.
To use quantize per tensor, change strategy from `channel` to `tensor`. To use group size quantization, change from `channel` to `group` and specify its value, say 128, by including `group_size: 128`. A group size quantization example is shown in `2:4_w4a16_group-128_recipe.yaml`.
The above recipe (`2of4_w4a16_recipe.yaml`) uses channel-wise quantization specified by `strategy: "channel"` in its config group.
To use quantize per tensor, change strategy from `channel` to `tensor`. To use group size quantization, change from `channel` to `group` and specify its value, say 128, by including `group_size: 128`. A group size quantization example is shown in `2of4_w4a16_group-128_recipe.yaml`.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from llmcompressor.transformers import SparseAutoModelForCausalLM, apply

# define a recipe to handle sparsity, finetuning and quantization
recipe = "2:4_w4a16_recipe.yaml"
recipe = "2of4_w4a16_recipe.yaml"

# load the model in as bfloat16 to save on memory and compute
model_stub = "neuralmagic/Llama-2-7b-ultrachat200k"
Expand All @@ -15,7 +15,7 @@
dataset = "ultrachat-200k"

# save location of quantized model
output_dir = "output_llama7b_2:4_w4a16_channel"
output_dir = "output_llama7b_2of4_w4a16_channel"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@

@pytest.fixture
def example_dir() -> str:
return "examples/quantization_24_sparse_w4a16"
return "examples/quantization_2of4_sparse_w4a16"


@pytest.mark.example
@requires_gpu_count(1)
class TestQuantization24SparseW4A16:
"""
Tests for examples in the "quantization_24_sparse_w4a16" example folder.
Tests for examples in the "quantization_2of4_sparse_w4a16" example folder.
"""

def test_doc_example_command(self, example_dir: str, tmp_path: Path):
Expand Down Expand Up @@ -52,7 +52,7 @@ def test_alternative_recipe(self, example_dir: str, tmp_path: Path):
script_path = tmp_path / example_dir / script_filename
content = script_path.read_text(encoding="utf-8")
content = content.replace(
"2:4_w4a16_recipe.yaml", "2:4_w4a16_group-128_recipe.yaml"
"2of4_w4a16_recipe.yaml", "2of4_w4a16_group-128_recipe.yaml"
)
script_path.write_text(content, encoding="utf-8")

Expand Down

0 comments on commit fa328db

Please sign in to comment.