-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc]: BNB 8 bit quantization is undocumented #10723
Comments
Indeed, please feel free to contribute this. Thank you very much! |
@jeejeelee I actually am unsure about the usage myself. I was hoping someone could help me out with that. I've seen the PR where 8 bit was introduced,but wasn't able to which arguments I must change while calling LLM(). |
I did request the author of the PR for clarification #7445 (comment) |
IIUC, you don't need to set the specific argument (see:https://github.com/vllm-project/vllm/blob/main/tests/quantization/test_bitsandbytes.py#L24), like : llm = LLM(
model=model_name,
trust_remote_code=True,
load_format="bitsandbytes",
quantization="bitsandbytes",
) |
@jeejeelee the code you shared works to give an 8-bit quantized BNB model when the But, as described in the docs, vLLM supports in-flight quantization, which takes the base full precision model ID and returns the 4-bit BNB quantized model. To achieve this you run the same code from your comment but give a full precision model path. Though you never mention the precision in this function call, it always returns a 4-bit quantized version. In-flight quantization is also supported in HuggingFace, which on the other hand, does the in-flight BNB quantization using vLLM's own The definition of the |
Currently, vLLM only supports 4-bit for in-flight quantization, see: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader/loader.py#L997. |
Should I close this then? |
Could you please submit a PR to clarify in the documentation that inflight quantization only supports 4-bit quantization? thanks very much |
The documentation does say that. |
There is currently no support for Inflight 8bit quantization. |
📚 The doc issue
BNB 8 bit quantization is apparently supported as of #7445, but there is no detail on how to load in 8 bit on the BNB documentation page
Suggest a potential alternative/fix
Give an example of using
load_in_4bit
/load_in_8bit
on the documentation pageBefore submitting a new issue...
The text was updated successfully, but these errors were encountered: