Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPTQ] Fix test #28018

Merged
merged 3 commits into from
Jan 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 13 additions & 14 deletions tests/quantization/gptq/test_gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,9 @@ def test_serialization(self):
with tempfile.TemporaryDirectory() as tmpdirname:
self.quantized_model.save_pretrained(tmpdirname)
if not self.use_exllama:
quantized_model_from_saved = AutoModelForCausalLM.from_pretrained(tmpdirname).to(0)
quantized_model_from_saved = AutoModelForCausalLM.from_pretrained(
tmpdirname, quantization_config=GPTQConfig(use_exllama=False, bits=4)
).to(0)
self.check_quantized_layers_type(quantized_model_from_saved, "cuda-old")
else:
# we need to put it directly to the gpu. Otherwise, we won't be able to initialize the exllama kernel
Expand All @@ -242,12 +244,11 @@ def test_change_loading_attributes(self):
with tempfile.TemporaryDirectory() as tmpdirname:
self.quantized_model.save_pretrained(tmpdirname)
if not self.use_exllama:
self.assertEqual(self.quantized_model.config.quantization_config.use_exllama, False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

self.check_quantized_layers_type(self.quantized_model, "cuda-old")
# we need to put it directly to the gpu. Otherwise, we won't be able to initialize the exllama kernel
quantized_model_from_saved = AutoModelForCausalLM.from_pretrained(
tmpdirname, quantization_config=GPTQConfig(use_exllama=True, bits=4), device_map={"": 0}
)
self.assertEqual(quantized_model_from_saved.config.quantization_config.use_exllama, True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, we don't save all the arguments anymore (only those in self.serialization_keys) by modifying to_dict() . The issue with that is that we are updating the quantization_config based on the one from optimum: config.quantization_config = GPTQConfig.from_dict_optimum(quantizer.to_dict()).
This line was needed since some args could change like use_exllama in optimum.

I was thinking on doing a PR to remove this line, and maybe not save args related to inference anymore (use_exllama,...) or revert the PR on optimum. What are your thoughts ?

Copy link
Collaborator

@amyeroberts amyeroberts Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was needed since some args could change like use_exllama in optimum.

Sorry, I don't completely follow. Does this mean that use_exllama will no longer change and the test check is no longer required?

I was thinking on doing a PR to remove this line, and maybe not save args related to inference anymore (use_exllama,...)

It depends. This can be considered a breaking change, as users might now expect these values in their configs. The most important thing is for old configs to still be loadable and produce the same result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't completely follow. Does this mean that use_exllama will no longer change and the test check is no longer required?

Basically, I mean that the user can set use_exllama=True in transformers and this value can change in optimum (use_exllama=False). However, since we don't serialize it anymore in optimum GPTQ config, use_exllama will be set to the default value through: config.quantization_config = GPTQConfig.from_dict_optimum(quantizer.to_dict()).

It depends. This can be considered a breaking change, as users might now expect these values in their configs. The most important thing is for old configs to still be loadable and produce the same result.

Yes, the old configs will still work. However, for new users, they will have to pass these args each time.

I will probably work on the second option then since from the start, we should not have to let the user select the kernel since we can switch from one to another.

self.assertEqual(quantized_model_from_saved.config.quantization_config.bits, self.bits)
self.check_quantized_layers_type(quantized_model_from_saved, "exllama")
self.check_inference_correctness(quantized_model_from_saved)
Expand Down Expand Up @@ -279,10 +280,10 @@ class GPTQTestActOrderExllama(unittest.TestCase):
"""

EXPECTED_OUTPUTS = set()
EXPECTED_OUTPUTS.add("Hello my name is Katie and I am a 20 year")
model_name = "hf-internal-testing/Llama-2-7B-GPTQ"
revision = "gptq-4bit-128g-actorder_True"
input_text = "Hello my name is"
EXPECTED_OUTPUTS.add("Hello, how are you ? I'm doing good, thanks for asking.")
# 4bit + act_order + 128g
model_name = "hf-internal-testing/TinyLlama-1.1B-Chat-v0.3-GPTQ"
input_text = "Hello, how are you ?"

@classmethod
def setUpClass(cls):
Expand All @@ -292,7 +293,6 @@ def setUpClass(cls):
cls.quantization_config = GPTQConfig(bits=4, max_input_length=4028)
cls.quantized_model = AutoModelForCausalLM.from_pretrained(
cls.model_name,
revision=cls.revision,
torch_dtype=torch.float16,
device_map={"": 0},
quantization_config=cls.quantization_config,
Expand Down Expand Up @@ -336,7 +336,7 @@ def test_max_input_length(self):
self.quantized_model.generate(**inp, num_beams=1, min_new_tokens=3, max_new_tokens=3)
self.assertTrue("temp_state buffer is too small" in str(cm.exception))

prompt = "I am in Paris and" * 500
prompt = "I am in Paris and"
inp = self.tokenizer(prompt, return_tensors="pt").to(0)
self.assertTrue(inp["input_ids"].shape[1] < 4028)
self.quantized_model.generate(**inp, num_beams=1, min_new_tokens=3, max_new_tokens=3)
Expand All @@ -355,10 +355,10 @@ class GPTQTestExllamaV2(unittest.TestCase):
"""

EXPECTED_OUTPUTS = set()
EXPECTED_OUTPUTS.add("Hello my name is Katie and I am a 20 year")
model_name = "hf-internal-testing/Llama-2-7B-GPTQ"
revision = "gptq-4bit-128g-actorder_True"
input_text = "Hello my name is"
EXPECTED_OUTPUTS.add("Hello, how are you ? I'm doing good, thanks for asking.")
# 4bit + act_order + 128g
model_name = "hf-internal-testing/TinyLlama-1.1B-Chat-v0.3-GPTQ"
input_text = "Hello, how are you ?"

@classmethod
def setUpClass(cls):
Expand All @@ -368,7 +368,6 @@ def setUpClass(cls):
cls.quantization_config = GPTQConfig(bits=4, exllama_config={"version": 2})
cls.quantized_model = AutoModelForCausalLM.from_pretrained(
cls.model_name,
revision=cls.revision,
torch_dtype=torch.float16,
device_map={"": 0},
quantization_config=cls.quantization_config,
Expand Down
Loading