[Fix] fix quantization arg when using marlin #3319

DreamTeamWangbowen · 2024-03-11T12:07:41Z

#3331 fix when using marlin model

DreamTeamWangbowen · 2024-03-11T12:16:24Z

@zhuohan123 @WoosukKwon please, could you help me to merge it?

DreamTeamWangbowen · 2024-03-11T12:21:07Z

After the fix is merged, it can run normally.

esmeetu · 2024-03-11T13:44:17Z

Thanks for your contribution! @DreamTeamWangbowen Could you fix the code style following CONTRIBUTING.md ?

DreamTeamWangbowen · 2024-03-12T01:39:23Z

Thanks for your contribution! @DreamTeamWangbowen Could you fix the code style following CONTRIBUTING.md ?

Okay, I submit a new issue and fix it.

esmeetu · 2024-03-12T04:16:03Z

please use sh format.sh to format your code. And then i can merge this.

WoosukKwon · 2024-03-12T05:03:03Z

@DreamTeamWangbowen Do we need this btw? IIUC, the Marlin kernel is automatically used for GPTQ models when the condition is met (act_order=False, etc.).

DreamTeamWangbowen · 2024-03-12T05:13:03Z

please use sh format.sh to format your code. And then i can merge this.

I've finished formatting my code.

DreamTeamWangbowen · 2024-03-12T05:21:20Z

@DreamTeamWangbowen Do we need this btw? IIUC, the Marlin kernel is automatically used for GPTQ models when the condition is met (act_order=False, etc.).

Yes, we need it, I did not find the act_order parameter in the code and model configuration file.

The model address I use is https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin

WoosukKwon · 2024-03-12T06:12:17Z

Yes, we need it, I did not find the act_order parameter in the code and model configuration file.
The model address I use is https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin

@DreamTeamWangbowen IIUC, the Marlin kernel should be automatically used (without specifying quantization=marlin):

vllm/vllm/config.py

Lines 174 to 176 in 654865e

    
           if (hf_quant_method == "gptq" 
        
                   and "is_marlin_format" in hf_quant_config 
        
                   and hf_quant_config["is_marlin_format"]):

While the condition was not actually about act_order (sorry for the wrong information), the mode configuration file meets the above condition.

DreamTeamWangbowen · 2024-03-12T06:23:19Z

Yes, we need it, I did not find the act_order parameter in the code and model configuration file.
The model address I use is https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin

@DreamTeamWangbowen IIUC, the Marlin kernel should be automatically used (without specifying quantization=marlin):

vllm/vllm/config.py

Lines 174 to 176 in 654865e

if (hf_quant_method == "gptq"

and "is_marlin_format" in hf_quant_config

and hf_quant_config["is_marlin_format"]):

While the condition was not actually about act_order (sorry for the wrong information), the mode configuration file meets the above condition.

If I do not specify marlin, but specify gptq, the following error will occur

File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 130, in init
self._verify_quantization()
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 204, in _verify_quantization
raise ValueError(
ValueError: Quantization method specified in the model config (marlin) does not match the quantization method specified in the quantization argument (gptq).

DreamTeamWangbowen · 2024-03-12T06:32:53Z

Yes, we need it, I did not find the act_order parameter in the code and model configuration file.
The model address I use is https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin

@DreamTeamWangbowen IIUC, the Marlin kernel should be automatically used (without specifying quantization=marlin):

vllm/vllm/config.py

Lines 174 to 176 in 654865e

if (hf_quant_method == "gptq"

and "is_marlin_format" in hf_quant_config

and hf_quant_config["is_marlin_format"]):

While the condition was not actually about act_order (sorry for the wrong information), the mode configuration file meets the above condition.

Look at the code here 178~185, there is a judgment here

vllm/vllm/config.py

Lines 178 to 185 in 654865e

    
           if self.quantization is None: 
        
               self.quantization = hf_quant_method 
        
           elif self.quantization != hf_quant_method: 
        
               raise ValueError( 
        
                   "Quantization method specified in the model config " 
        
                   f"({hf_quant_method}) does not match the quantization " 
        
                   f"method specified in the `quantization` argument " 
        
                   f"({self.quantization}).")

Therefore we need to add marlin to the quantization argument.

There may be another way to modify it, or add and change self.quantization="gptq" to self.quantization="marlin" on line 177

vllm/vllm/config.py

Line 177 in 654865e

hf_quant_method = "marlin"

esmeetu · 2024-03-12T08:57:44Z

@DreamTeamWangbowen I think you can run your server without pass -q or --quantization args since vLLM will detect quant method from config.json.

DreamTeamWangbowen · 2024-03-12T09:23:15Z

@DreamTeamWangbowen I think you can run your server without pass -q or --quantization args since vLLM will detect quant method from config.json.

Yes, but if the quantization parameter is specified as gptq or marlin, an error will occur when the server is running.

robertgshaw2-redhat · 2024-03-12T19:27:03Z

@DreamTeamWangbowen

Marlin kernels use a special serialization method that are different from exllama. So the model must be saved on disk in Marlin format to be loaded by vLLM. vLLM currently does not support converting formats on the fly (though this is something we are working on). The model must be saved to disk in Marlin format to run in vLLM.

I added the functionality to save models in Marlin format to AutoGPTQ. Here's an example here:

https://github.com/robertgshaw2-neuralmagic/marlin-example/blob/master/apply_gptq_save_marlin.py

Also here's a model I saved in this format:

https://huggingface.co/neuralmagic/zephyr-7b-beta-marlin

I will add a doc to vLLM about this

Note --> Marlin currently requires group_size=128 act_order=False. We are working on expanding this

@WoosukKwon FYI

Passing marlin to this argument will not work

DreamTeamWangbowen · 2024-03-13T01:43:56Z

@DreamTeamWangbowen

Marlin kernels use a special serialization method that are different from exllama. So the model must be saved on disk in Marlin format to be loaded by vLLM. vLLM currently does not support converting formats on the fly (though this is something we are working on). The model must be saved to disk in Marlin format to run in vLLM.

I added the functionality to save models in Marlin format to AutoGPTQ. Here's an example here:

https://github.com/robertgshaw2-neuralmagic/marlin-example/blob/master/apply_gptq_save_marlin.py

Also here's a model I saved in this format:

https://huggingface.co/neuralmagic/zephyr-7b-beta-marlin

I will add a doc to vLLM about this

Note --> Marlin currently requires group_size=128 act_order=False. We are working on expanding this

@WoosukKwon FYI

Passing marlin to this argument will not work

yeath, the model I use is in marlin format https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin

robertgshaw2-redhat · 2024-03-13T02:05:52Z

@DreamTeamWangbowen
Marlin kernels use a special serialization method that are different from exllama. So the model must be saved on disk in Marlin format to be loaded by vLLM. vLLM currently does not support converting formats on the fly (though this is something we are working on). The model must be saved to disk in Marlin format to run in vLLM.
I added the functionality to save models in Marlin format to AutoGPTQ. Here's an example here:

https://github.com/robertgshaw2-neuralmagic/marlin-example/blob/master/apply_gptq_save_marlin.py

Also here's a model I saved in this format:

https://huggingface.co/neuralmagic/zephyr-7b-beta-marlin

I will add a doc to vLLM about this
Note --> Marlin currently requires group_size=128 act_order=False. We are working on expanding this
@WoosukKwon FYI
Passing marlin to this argument will not work

yeath, the model I use is in marlin format https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin

Nice - vllm will use marlin by default if you pass this model. You do not need to pass the --quantization argument explicitly

DreamTeamWangbowen · 2024-03-13T02:21:41Z

@DreamTeamWangbowen
Marlin kernels use a special serialization method that are different from exllama. So the model must be saved on disk in Marlin format to be loaded by vLLM. vLLM currently does not support converting formats on the fly (though this is something we are working on). The model must be saved to disk in Marlin format to run in vLLM.
I added the functionality to save models in Marlin format to AutoGPTQ. Here's an example here:

https://github.com/robertgshaw2-neuralmagic/marlin-example/blob/master/apply_gptq_save_marlin.py

Also here's a model I saved in this format:

https://huggingface.co/neuralmagic/zephyr-7b-beta-marlin

I will add a doc to vLLM about this
Note --> Marlin currently requires group_size=128 act_order=False. We are working on expanding this
@WoosukKwon FYI
Passing marlin to this argument will not work

yeath, the model I use is in marlin format https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin

Nice - vllm will use marlin by default if you pass this model. You do not need to pass the --quantization argument explicitly

Yes, the marlin model I am using is act_order=False and group_size=128.

But I understand that the -q argument does not tell vllm to specify the type of quantization method used, such as awq, gptq, so I added marlin, for example tell vllm to use MarlinLinearMethod

WoosukKwon · 2024-03-13T05:48:35Z

@DreamTeamWangbowen IIUC, Marlin is not a quantization method. It's a fast kernel implementation for GPTQ.

I've updated the PR so that it fixes the bug that quantization="gptq" raises an error when Marlin is enabled.

DreamTeamWangbowen · 2024-03-13T06:11:39Z

@DreamTeamWangbowen IIUC, Marlin is not a quantization method. It's a fast kernel implementation for GPTQ.

I've updated the PR so that it fixes the bug that quantization="gptq" raises an error when Marlin is enabled.

Yeath, you are right, thank you very much. :)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

fix quantization argument

6197795

DreamTeamWangbowen changed the title ~~when using marlin, fix quantization argument~~ [fix]when using marlin, fix quantization argument Mar 11, 2024

DreamTeamWangbowen changed the title ~~[fix]when using marlin, fix quantization argument~~ [Fix] when using marlin, fix quantization argument Mar 11, 2024

DreamTeamWangbowen closed this Mar 11, 2024

DreamTeamWangbowen reopened this Mar 11, 2024

DreamTeamWangbowen changed the title ~~[Fix] when using marlin, fix quantization argument~~ [Fix] fix quantization arg when using marlin Mar 12, 2024

DreamTeamWangbowen added 2 commits March 12, 2024 12:33

fix marlin arg

2e34699

fix marlin arg

b1fe62e

fix

c6a453b

WoosukKwon added 2 commits March 13, 2024 05:44

Revert

57b3b70

Allow quantization='gptq' when using Marlin

e52a225

WoosukKwon approved these changes Mar 13, 2024

View reviewed changes

WoosukKwon merged commit b167109 into vllm-project:main Mar 13, 2024
3 checks passed

starmpcc pushed a commit to starmpcc/vllm that referenced this pull request Mar 14, 2024

[Fix] Fix quantization="gptq" when using Marlin (vllm-project#3319)

15f1d27

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Fix] Fix quantization="gptq" when using Marlin (vllm-project#3319)

3b5ee14

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix quantization arg when using marlin #3319

[Fix] fix quantization arg when using marlin #3319

DreamTeamWangbowen commented Mar 11, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 11, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 11, 2024

esmeetu commented Mar 11, 2024

DreamTeamWangbowen commented Mar 12, 2024

esmeetu commented Mar 12, 2024

WoosukKwon commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024 •

edited

Loading

WoosukKwon commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024 •

edited

Loading

esmeetu commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024

robertgshaw2-redhat commented Mar 12, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 13, 2024

robertgshaw2-redhat commented Mar 13, 2024

DreamTeamWangbowen commented Mar 13, 2024 •

edited

Loading

WoosukKwon commented Mar 13, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 13, 2024 •

edited

Loading

[Fix] fix quantization arg when using marlin #3319

[Fix] fix quantization arg when using marlin #3319

Conversation

DreamTeamWangbowen commented Mar 11, 2024 • edited Loading

DreamTeamWangbowen commented Mar 11, 2024 • edited Loading

DreamTeamWangbowen commented Mar 11, 2024

esmeetu commented Mar 11, 2024

DreamTeamWangbowen commented Mar 12, 2024

esmeetu commented Mar 12, 2024

WoosukKwon commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024 • edited Loading

WoosukKwon commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024 • edited Loading

esmeetu commented Mar 12, 2024

DreamTeamWangbowen commented Mar 12, 2024

robertgshaw2-redhat commented Mar 12, 2024 • edited Loading

DreamTeamWangbowen commented Mar 13, 2024

robertgshaw2-redhat commented Mar 13, 2024

DreamTeamWangbowen commented Mar 13, 2024 • edited Loading

WoosukKwon commented Mar 13, 2024 • edited Loading

DreamTeamWangbowen commented Mar 13, 2024 • edited Loading

DreamTeamWangbowen commented Mar 11, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 11, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 12, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 12, 2024 •

edited

Loading

robertgshaw2-redhat commented Mar 12, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 13, 2024 •

edited

Loading

WoosukKwon commented Mar 13, 2024 •

edited

Loading

DreamTeamWangbowen commented Mar 13, 2024 •

edited

Loading