-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] fix quantization arg when using marlin #3319
Conversation
@zhuohan123 @WoosukKwon please, could you help me to merge it? |
After the fix is merged, it can run normally. |
Thanks for your contribution! @DreamTeamWangbowen Could you fix the code style following CONTRIBUTING.md ? |
Okay, I submit a new issue and fix it. |
please use |
@DreamTeamWangbowen Do we need this btw? IIUC, the Marlin kernel is automatically used for GPTQ models when the condition is met ( |
I've finished formatting my code. |
Yes, we need it, I did not find the act_order parameter in the code and model configuration file. The model address I use is https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin |
@DreamTeamWangbowen IIUC, the Marlin kernel should be automatically used (without specifying Lines 174 to 176 in 654865e
While the condition was not actually about |
If I do not specify marlin, but specify gptq, the following error will occur File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 130, in init |
Look at the code here 178~185, there is a judgment here Lines 178 to 185 in 654865e
Therefore we need to add marlin to the quantization argument.
There may be another way to modify it, or add and change self.quantization="gptq" to self.quantization="marlin" on line 177 Line 177 in 654865e
|
@DreamTeamWangbowen I think you can run your server without pass |
Yes, but if the quantization parameter is specified as gptq or marlin, an error will occur when the server is running. |
Marlin kernels use a special serialization method that are different from exllama. So the model must be saved on disk in Marlin format to be loaded by vLLM. vLLM currently does not support converting formats on the fly (though this is something we are working on). The model must be saved to disk in Marlin format to run in vLLM. I added the functionality to save models in Marlin format to AutoGPTQ. Here's an example here: Also here's a model I saved in this format: I will add a doc to vLLM about this Note --> Marlin currently requires @WoosukKwon FYI Passing |
yeath, the model I use is in marlin format https://huggingface.co/neuralmagic/Nous-Hermes-2-Yi-34B-marlin |
Nice - vllm will use marlin by default if you pass this model. You do not need to pass the |
Yes, the marlin model I am using is But I understand that the |
@DreamTeamWangbowen IIUC, Marlin is not a quantization method. It's a fast kernel implementation for GPTQ. I've updated the PR so that it fixes the bug that |
Yeath, you are right, thank you very much. :) |
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
#3331 fix when using marlin model