-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Triton Configs for Fp8 Block Quantization #11589
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
b2b509e
to
6104dc3
Compare
Signed-off-by: mgoin <michael@neuralmagic.com>
@simon-mo This PR looks good an works well.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once the new json files are added to package_data in setup.py
On 8xH200:
python3 benchmark_throughput.py --model $MODEL --tensor-parallel-size 8 --max-model-len 8192 --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 --trust-remote-code --tokenizer deepseek-ai/DeepSeek-R1
>> Throughput: 2.27 requests/s, 1000.55 total tokens/s, 483.31 output tokens/s
python3 benchmark_throughput.py --model $MODEL --tensor-parallel-size 8 --max-model-len 8192 --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 --trust-remote-code --tokenizer deepseek-ai/DeepSeek-R1
>> Throughput: 1.66 requests/s, 731.39 total tokens/s, 353.29 output tokens/s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, LGTM
Mixtral looks good (performance and accuracy) |
MODEL=/data/nm/models/DeepSeek-R1
lm_eval --model vllm \
--model_args "pretrained=$MODEL,tokenizer=deepseek-ai/DeepSeek-R1,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.95,trust_remote_code=True,max_model_len=16384,trust_remote_code=True" \
--task gsm8k --batch_size 100
vllm (pretrained=/data/nm/models/DeepSeek-R1,tokenizer=deepseek-ai/DeepSeek-R1,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.95,trust_remote_code=True,max_model_len=16384,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9538|± |0.0058|
| | |strict-match | 5|exact_match|↑ |0.9538|± |0.0058|
vllm (pretrained=/data/nm/models/DeepSeek-R1,tokenizer=deepseek-ai/DeepSeek-R1,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.95,trust_remote_code=True,max_model_len=16384,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9462|± |0.0062|
| | |strict-match | 5|exact_match|↑ |0.9462|± |0.0062| |
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>
SUMMARY: