-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vLLM e2e tests #117
Merged
Add vLLM e2e tests #117
Changes from 21 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
599afa2
add first test
dsikka 1661f7c
update tests
dsikka d9d687b
update to use config files
dsikka c8dc2a9
update test
dsikka 79281f5
update to add int8 tests
dsikka f49f9b4
update
dsikka ff01253
fix condition
dsikka b55d5e7
fix typo
dsikka 7e247b5
add w8a16
dsikka 09bd1ed
update
dsikka 2c6beb0
update to clear session and delete dirs
dsikka 6a94670
conditional import for vllm
dsikka 21fc505
update
dsikka 5820a4b
update num samples
dsikka 2357493
add more test cases; add custom recipe support
dsikka 98346eb
update model
dsikka 1a078b6
updat recipe modifier
dsikka af946f2
Update fp8_weight_only.yaml
dsikka 0766c1a
add more test cases
dsikka 45f99c2
try a larger model
dsikka 0abaf11
revert
dsikka d6625dd
add description; save model to hub post testing
dsikka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
scheme: FP8_DYNAMIC |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
scheme: FP8 | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
recipe: tests/e2e/vLLM/recipes/FP8/recipe_fp8_weight_only_channel.yaml | ||
scheme: FP8A16_channel |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
recipe: tests/e2e/vLLM/recipes/FP8/recipe_fp8_weight_only_per_tensor.yaml | ||
scheme: FP8A16_tensor |
7 changes: 7 additions & 0 deletions
7
tests/e2e/vLLM/configs/INT8/int8_channel_weight_static_per_tensor_act.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
recipe: tests/e2e/vLLM/recipes/INT8/recipe_int8_channel_weight_static_per_tensor_act.yaml | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft | ||
scheme: W8A8_channel_weight_static_per_tensor |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
scheme: W8A8 | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft |
7 changes: 7 additions & 0 deletions
7
tests/e2e/vLLM/configs/INT8/int8_tensor_weight_static_per_tensor_act.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
recipe: tests/e2e/vLLM/recipes/INT8/recipe_int8_tensor_weight_static_per_tensor_act.yaml | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft | ||
scheme: W8A8_tensor_weight_static_per_tensor_act |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
scheme: W4A16_channel | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft | ||
recipe: tests/e2e/vLLM/recipes/WNA16/recipe_w4a16_channel_quant.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
scheme: W4A16 | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
scheme: W8A16_channel | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft | ||
recipe: tests/e2e/vLLM/recipes/WNA16/recipe_w8a16_channel_quant.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
cadence: "nightly" | ||
test_type: "regression" | ||
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
scheme: W8A16 | ||
dataset_id: HuggingFaceH4/ultrachat_200k | ||
dataset_split: train_sft |
9 changes: 9 additions & 0 deletions
9
tests/e2e/vLLM/recipes/FP8/recipe_fp8_weight_only_channel.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
quant_stage: | ||
quant_modifiers: | ||
QuantizationModifier: | ||
sequential_update: false | ||
ignore: [lm_head] | ||
config_groups: | ||
group_0: | ||
weights: {num_bits: 8, type: float, symmetric: true, strategy: channel, dynamic: false} | ||
targets: [Linear] |
9 changes: 9 additions & 0 deletions
9
tests/e2e/vLLM/recipes/FP8/recipe_fp8_weight_only_per_tensor.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
quant_stage: | ||
quant_modifiers: | ||
QuantizationModifier: | ||
sequential_update: false | ||
ignore: [lm_head] | ||
config_groups: | ||
group_0: | ||
weights: {num_bits: 8, type: float, symmetric: true, strategy: tensor, dynamic: false} | ||
targets: [Linear] |
10 changes: 10 additions & 0 deletions
10
tests/e2e/vLLM/recipes/INT8/recipe_int8_channel_weight_static_per_tensor_act.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
quant_stage: | ||
quant_modifiers: | ||
QuantizationModifier: | ||
sequential_update: false | ||
ignore: [lm_head] | ||
config_groups: | ||
group_0: | ||
weights: {num_bits: 8, type: int, symmetric: true, strategy: channel} | ||
input_activations: {num_bits: 8, type: int, symmetric: true, strategy: tensor} | ||
targets: [Linear] |
10 changes: 10 additions & 0 deletions
10
tests/e2e/vLLM/recipes/INT8/recipe_int8_tensor_weight_static_per_tensor_act.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
quant_stage: | ||
quant_modifiers: | ||
QuantizationModifier: | ||
sequential_update: false | ||
ignore: [lm_head] | ||
config_groups: | ||
group_0: | ||
weights: {num_bits: 8, type: int, symmetric: true, strategy: tensor} | ||
input_activations: {num_bits: 8, type: int, symmetric: true, strategy: tensor} | ||
targets: [Linear] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
quant_stage: | ||
quant_modifiers: | ||
QuantizationModifier: | ||
sequential_update: false | ||
ignore: [lm_head] | ||
config_groups: | ||
group_0: | ||
weights: {num_bits: 4, type: int, symmetric: true, strategy: channel, dynamic: false} | ||
targets: [Linear] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
quant_stage: | ||
quant_modifiers: | ||
QuantizationModifier: | ||
sequential_update: false | ||
ignore: [lm_head] | ||
config_groups: | ||
group_0: | ||
weights: {num_bits: 8, type: int, symmetric: true, strategy: channel, dynamic: false} | ||
targets: [Linear] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
import shutil | ||
import unittest | ||
|
||
import pytest | ||
from datasets import load_dataset | ||
from parameterized import parameterized_class | ||
from transformers import AutoTokenizer | ||
|
||
from llmcompressor.modifiers.quantization import QuantizationModifier | ||
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot | ||
from tests.testing_utils import parse_params, requires_gpu, requires_torch | ||
|
||
try: | ||
from vllm import LLM, SamplingParams | ||
|
||
vllm_installed = True | ||
except ImportError: | ||
vllm_installed = False | ||
|
||
# Defines the file paths to the directories containing the test configs | ||
# for each of the quantization schemes | ||
WNA16 = "tests/e2e/vLLM/configs/WNA16" | ||
FP8 = "tests/e2e/vLLM/configs/FP8" | ||
INT8 = "tests/e2e/vLLM/configs/INT8" | ||
|
||
|
||
@requires_gpu | ||
@requires_torch | ||
@pytest.mark.skipif(not vllm_installed, reason="vLLM is not installed, skipping test") | ||
@parameterized_class(parse_params([WNA16, FP8, INT8])) | ||
class TestvLLM(unittest.TestCase): | ||
model = None | ||
scheme = None | ||
dataset_id = None | ||
dataset_split = None | ||
recipe = None | ||
|
||
def setUp(self): | ||
print("========== RUNNING ==============") | ||
print(self.scheme) | ||
|
||
self.save_dir = None | ||
self.device = "cuda:0" | ||
self.oneshot_kwargs = {} | ||
self.num_calibration_samples = 256 | ||
self.max_seq_length = 1048 | ||
self.prompts = [ | ||
"The capital of France is", | ||
"The president of the US is", | ||
"My name is", | ||
] | ||
|
||
def test_vllm(self): | ||
# Load model. | ||
loaded_model = SparseAutoModelForCausalLM.from_pretrained( | ||
self.model, device_map=self.device, torch_dtype="auto" | ||
) | ||
tokenizer = AutoTokenizer.from_pretrained(self.model) | ||
|
||
def preprocess(example): | ||
return { | ||
"text": tokenizer.apply_chat_template( | ||
example["messages"], | ||
tokenize=False, | ||
) | ||
} | ||
|
||
def tokenize(sample): | ||
return tokenizer( | ||
sample["text"], | ||
padding=False, | ||
max_length=self.max_seq_length, | ||
truncation=True, | ||
add_special_tokens=False, | ||
) | ||
|
||
if self.dataset_id: | ||
ds = load_dataset(self.dataset_id, split=self.dataset_split) | ||
ds = ds.shuffle(seed=42).select(range(self.num_calibration_samples)) | ||
ds = ds.map(preprocess) | ||
ds = ds.map(tokenize, remove_columns=ds.column_names) | ||
self.oneshot_kwargs["dataset"] = ds | ||
self.oneshot_kwargs["max_seq_length"] = self.max_seq_length | ||
self.oneshot_kwargs["num_calibration_samples"] = ( | ||
self.num_calibration_samples | ||
) | ||
|
||
self.save_dir = self.model.split("/")[1] + f"-{self.scheme}" | ||
self.oneshot_kwargs["model"] = loaded_model | ||
if self.recipe: | ||
self.oneshot_kwargs["recipe"] = self.recipe | ||
else: | ||
# Test assumes that if a recipe was not provided, using | ||
# a compatible preset sceme from: | ||
# https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py | ||
self.oneshot_kwargs["recipe"] = QuantizationModifier( | ||
targets="Linear", scheme=self.scheme, ignore=["lm_head"] | ||
) | ||
|
||
# Apply quantization. | ||
print("ONESHOT KWARGS", self.oneshot_kwargs) | ||
oneshot( | ||
**self.oneshot_kwargs, | ||
output_dir=self.save_dir, | ||
clear_sparse_session=True, | ||
oneshot_device=self.device, | ||
) | ||
tokenizer.save_pretrained(self.save_dir) | ||
# Run vLLM with saved model | ||
print("================= RUNNING vLLM =========================") | ||
sampling_params = SamplingParams(temperature=0.80, top_p=0.95) | ||
llm = LLM(model=self.save_dir) | ||
outputs = llm.generate(self.prompts, sampling_params) | ||
print("================= vLLM GENERATION ======================") | ||
for output in outputs: | ||
assert output | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print("PROMPT", prompt) | ||
print("GENERATED TEXT", generated_text) | ||
Satrat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def tearDown(self): | ||
shutil.rmtree(self.save_dir) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a test for tp>1 is also a good idea if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah I think that'll be a follow-up test since the structure will change a bit to deal with tp>1 with the same process
I do think that's more of a vLLM test. If anything, we could extend this to publish test models which are then pulled down for all vllm tests.