vllm-project · robertgshaw2-neuralmagic · Jul 2, 2024 · Jun 24, 2024 · Jun 24, 2024 · Jun 24, 2024
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -15,7 +15,7 @@ A clear and concise description of what you expected to happen.
 Include all relevant environment information:
 1. OS [e.g. Ubuntu 18.04]:
 2. Python version [e.g. 3.7]:
-3. SparseML version or commit hash [e.g. 0.1.0, `f7245c8`]:
+3. LLM Compressor version or commit hash [e.g. 0.1.0, `f7245c8`]:
 4. ML framework version(s) [e.g. torch 1.7.1]:
 5. Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]:
 6. Other relevant environment information [e.g. hardware, CUDA version]:

diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,6 @@
+SUMMARY:
+"please provide a brief summary"
+
+
+TEST PLAN:
+"please outline how the changes were tested"
diff --git a/.gitignore b/.gitignore
@@ -801,3 +801,4 @@ nm_temp_test_logs/*
 sparse_logs/*
 wandb/
 output_finetune/
+env_log.json
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,86 +1,53 @@
-<!--
-Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
+# Contributing to LLM Compressor
 
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+Thank you for your interest in contributing to LLM Compressor!
+Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large.
+There are several ways you can contribute to the project:
 
-   http://www.apache.org/licenses/LICENSE-2.0
+- Identify and report any issues or bugs.
+- Request or add new compression methods or research.
+- Suggest or implement new features.
 
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
--->
+However, remember that contributions aren't just about code.
+We believe in the power of community support; thus, answering queries, assisting others, and enhancing the documentation are highly regarded and beneficial contributions.
 
-TODO: update for upstream push
+Finally, one of the most impactful ways to support us is by raising awareness about LLM Compressor and the vLLM community.
+Talk about it in your blog posts, highlighting how it's driving your incredible projects.
+Express your support on Twitter if vLLM aids you, or simply offer your appreciation by starring our repository.
 
-# Contributing to SparseML
+## Setup for development
 
-If you’re reading this, hopefully we have piqued your interest to take the next step. Join us and help make SparseML even better! As a contributor, here are some community guidelines we would like you to follow:
+### Install from source
 
-- [Code of Conduct](#code-of-conduct)
-- [Ways to Contribute](#ways-to-contribute)
-- [Bugs and Feature Requests](#bugs-and-feature-requests)
-- [Question or Problem](#question-or-problem)
-- [Developing SparseML](DEVELOPING.md)
+```bash
+pip install -e ./[dev]
+```
 
-## Code of Conduct
+### Code Styling and Formatting checks
 
-Help us keep the software inclusive. Please read and follow our [Code of Conduct](https://github.com/neuralmagic/sparseml/blob/main/CODE_OF_CONDUCT.md) in order to promote an environment that is friendly, fair, respectful, and safe. We want to inspire collaboration, innovation, and fun!
+```bash
+make style
+make quality
+```
 
-## Ways to Contribute
+### Testing
 
-Whether you’re a newbie, dabbler, or expert, we appreciate you jumping in.
+```bash
+make test
+```
 
-### Contributing Code
+## Contributing Guidelines
 
-- Make pull requests for addressing bugs, open issues, and documentation
-- Neural Magic as the maintainer will do reviews and final merge
+### Issue Reporting
 
-### Reporting In
+If you encounter a bug or have a feature request, please check our issues page first to see if someone else has already reported it.
+If not, please file a new issue, providing as much relevant information as possible.
 
-- See something, say something: bugs, documentation
-- Propose new feature requests to Neural Magic
+### Pull Requests & Code Reviews
 
-### Helping Others
+Please check the PR checklist in the [PR template](.github/PULL_REQUEST_TEMPLATE.md) for detailed guide for contribution.
 
-- Answer open discussion topics
-- Spread the word about SparseML
-- Teach and empower others. This is the way!
+### Thank You
 
-## Bugs and Feature Requests
-
-Please search through existing issues and requests first to avoid duplicates. Neural Magic will work with you further to take next steps.
-
-- Go to: [GitHub Issues](https://github.com/vllm-project/llm-compressor/issues)
-
-For bugs, include:
-
-- brief summary
-- OS/Environment details
-- steps to reproduce (s.t.r.)
-- code snippets, screenshots/casts, log content, sample models
-- add the GitHub label "bug" to your post
-
-For feature requests, include:
-
-- problem you’re trying to solve
-- community benefits
-- other relevant details to support your proposal
-- add the GitHub label "enhancement" to your post
-
-For documentation edits, include:
-
-- current state, proposed state
-- if applicable, screenshots/casts
-- add the GitHub label "documentation" to your post
-
-## Question or Problem
-
-Sign up or log in to our [**Neural Magic Community Slack**](https://neuralmagic.com/community/). We are growing the community member by member and happy to see you there. Don’t forget to search through existing discussions to avoid duplication! Thanks!
-
-## Developing SparseML
-
-Made it this far? Review [Developing SparseML](DEVELOPING.md) to get started.
+Finally, thank you for taking the time to read these guidelines and for your interest in contributing to LLM Compressor.
+Your contributions make LLM Compressor a great tool for everyone!
diff --git a/DEVELOPING.md b/DEVELOPING.md
@@ -1,25 +1,7 @@
-<!--
-Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
+# Developing LLM Compressor
 
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
--->
-
-TODO: update for upstream push
-
-# Developing SparseML
-
-SparseML is developed and tested using Python 3.8-3.11.
-To develop SparseML, you will also need the development dependencies and to follow the styling guidelines.
+LLM Compressor is developed and tested using Python 3.8-3.11.
+To develop LLM Compressor, you will also need the development dependencies and to follow the styling guidelines.
 
 Here are some details to get started.
 
@@ -33,17 +15,7 @@ cd llm-compressor
 python3 -m pip install -e "./[dev]"
 ```
 
-This will clone the SparseML repo, install it, and install the development dependencies.
-
-To develop framework specific features, you will also need the relevant framework packages.
-Those can be installed by adding the framework name to the install extras. Frameworks include
-`torch`, `keras`, and `tensorflow_v1`. For example:
-```bash
-python3 -m pip install -e "./[dev,torch]"
-```
-
-Note: Running all pytorch tests using `make test TARGETS=torch`, also requires `torchvision`
-and `onnxruntime` install all these dependencies using `python3 -m pip install -e "./[dev, torch, torchvision, onnxruntime]"`
+This will clone the LLM Compressor repo, install it, and install the development dependencies.
 
 **Code Styling and Formatting checks**
 
@@ -52,22 +24,16 @@ make style
 make quality
 ```
 
-This will run automatic code styling using `black` and `isort` and test that the
+This will run automatic code styling using `ruff`, `flake8`, `black`, and `isort` to test that the
 repository's code matches its standards.
 
 **EXAMPLE: test changes locally**
 
 ```bash
-make test TARGETS=<CSV of frameworks to run>
+make test
 ```
 
-This will run the targeted SparseML unit tests for the frameworks specified.
-The targets should be specified, because not all framework dependencies can be installed to run all tests.
-
-To run just PyTorch tests, run
-```bash
-make test TARGETS=pytorch
-```
+This will run the targeted LLM Compressor unit tests for the frameworks specified.
 
 File any error found before changes as an Issue and fix any errors found after making changes before submitting a Pull Request.
 
@@ -92,7 +58,7 @@ File any error found before changes as an Issue and fix any errors found after m
 3. Add a remote to keep up with upstream changes.
 
    ```bash
-   git remote add upstream https://github.com/neuralmagic/sparseml.git
+   git remote add upstream https://github.com/vllm-project/llm-compressor.git
    ```
 
    If you already have a copy, fetch upstream changes.

diff --git a/examples/finetuning/configure_fsdp.md b/examples/finetuning/configure_fsdp.md
@@ -1,31 +1,15 @@
-<!--
-Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
--->
-
 # Configuring FSDP for Sparse Finetuning
 
 An example FSDP configuration file, `example_fsdp_config.yaml`, is provided in this
-folder. It can be used out of the box by editting the `num_processes` parameter to 
+folder. It can be used out of the box by editing the `num_processes` parameter to 
 fit the number of GPUs on your machine.
 
 You can also customize your own config file by running the following prompt
 ```
 accelerate config
 ```
 
-An FSDP config file can be passed to the SparseML finetuning script like this:
+An FSDP config file can be passed to the LLM Compressor finetuning script like this:
 ```
-accelerate launch --config_file example_fsdp_config.yaml --no_python sparseml.transformers.text_generation.finetune
+accelerate launch --config_file example_fsdp_config.yaml --no_python llmcompressor.transformers.text_generation.finetune
 ```
diff --git a/examples/quantization/llama7b_fp8_quantization.py b/examples/quantization/llama7b_fp8_quantization.py
@@ -0,0 +1,38 @@
+import torch
+from datasets import load_dataset
+from transformers import AutoTokenizer
+
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
+
+model_stub = "meta-llama/Meta-Llama-3-8B-Instruct"
+output_dir = "Meta-Llama-3-8B-Instruct-FP8-Compressed"
+num_calibration_samples = 512
+
+tokenizer = AutoTokenizer.from_pretrained(model_stub, use_fast=True)
+tokenizer.pad_token = tokenizer.eos_token
+
+
+def preprocess(batch):
+    text = tokenizer.apply_chat_template(batch["messages"], tokenize=False)
+    tokenized = tokenizer(text, padding=True, truncation=True, max_length=2048)
+    return tokenized
+
+
+ds = load_dataset("mgoin/ultrachat_2k", split="train_sft")
+examples = ds.map(preprocess, remove_columns=ds.column_names)
+
+recipe = QuantizationModifier(targets="Linear", scheme="FP8")
+
+model = SparseAutoModelForCausalLM.from_pretrained(
+    model_stub, torch_dtype=torch.bfloat16, device_map="auto"
+)
+
+oneshot(
+    model=model,
+    dataset=examples,
+    recipe=recipe,
+    output_dir=output_dir,
+    num_calibration_samples=num_calibration_samples,
+    save_compressed=True,
+)
diff --git a/examples/quantization/llama7b_one_shot_quantization.md b/examples/quantization/llama7b_one_shot_quantization.md
@@ -1,7 +1,7 @@
 # Creating a Quantized Llama Model in One Shot
 
 Quantizing a model to a lower precision can save on both memory and speed at inference time.
-This example demonstrates how to use the SparseML API to quantize a Llama model from 16 bits
+This example demonstrates how to use the LLM Compressor API to quantize a Llama model from 16 bits
 to 4 bits and save it to a compressed-tensors format for inference with vLLM.
 
 ## Step 1: Select a model and dataset

diff --git a/examples/quantization/llama7b_sparse_quantized/llama7b_sparse_w4a16.py b/examples/quantization/llama7b_sparse_quantized/llama7b_sparse_w4a16.py
@@ -11,7 +11,7 @@
     model_stub, torch_dtype=torch.bfloat16, device_map="auto"
 )
 
-# uses SparseML's built-in preprocessing for ultra chat
+# uses LLM Compressor's built-in preprocessing for ultra chat
 dataset = "ultrachat-200k"
 
 # save location of quantized model