[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

LiuXiaoxuanPKU · 2024-07-01T23:35:44Z

Add logits_soft_cap for flashinfer, which is needed by Gemma2 model, also add a simple gemma2 test.

Yard1

we should check the flashinfer version and raise if it's too old

Yard1 · 2024-07-02T00:52:09Z

vllm/worker/model_runner.py

+            logger.warning("Please use Flashinfer backend for models with"
+                           "logits_soft_cap (i.e., Gemma-2)."
+                           " Otherwise, the output might be wrong."
+                           " Set Flashinfer backend by "
+                           "export VLLM_ATTENTION_BACKEND=FLASHINFER.")


we should just raise an exception IMO.

.buildkite/test-pipeline.yaml

Co-authored-by: Simon Mo <simon.mo@hey.com>

yongxb · 2024-07-03T03:22:11Z

vllm/worker/model_runner.py

+        logits_soft_cap = getattr(self.model_config.hf_config,
+                                  'final_logit_softcapping', None)
+        if logits_soft_cap is not None and self.attn_backend.get_name(
+        ) != "flashinfer":


Could I check if logits_soft_cap is supposed to be the attn_logit_softcapping value instead? The two values are different in the Gemma2 config.

"attn_logit_softcapping": 50.0, "final_logit_softcapping": 30.0,

@yongxb Nice catch! final_logit_softcapping is used to cap the final logits before sampling. @LiuXiaoxuanPKU Could you please fix this?

zifeitong · 2024-07-03T17:42:00Z

I think this warning can be removed to avoid confusion:

vllm/vllm/model_executor/models/gemma2.py

Line 140 in 7cd2ebb

if self.config.attn_logit_softcapping is not None:

zifeitong · 2024-07-03T20:13:54Z

I am able to run and reproduce the reported MMLU scores for both 9b and 27b models 👍

However, if I don't disable CUDA graph, vLLM will crash with this error:

ERROR 07-04 03:34:04 async_llm_engine.py:53]   File "vllm/vllm/worker/model_runner.py", line 1202, in execute_model
ERROR 07-04 03:34:04 async_llm_engine.py:53]     model_input.attn_metadata.decode_wrapper = self.graph_runners[
ERROR 07-04 03:34:04 async_llm_engine.py:53] IndexError: list index out of range

LiuXiaoxuanPKU · 2024-07-03T20:44:12Z

I am able to run and reproduce the reported MMLU scores for both 9b and 27b models 👍

However, if I don't disable CUDA graph, vLLM will crash with this error:
ERROR 07-04 03:34:04 async_llm_engine.py:53]   File "vllm/vllm/worker/model_runner.py", line 1202, in execute_model
ERROR 07-04 03:34:04 async_llm_engine.py:53]     model_input.attn_metadata.decode_wrapper = self.graph_runners[
ERROR 07-04 03:34:04 async_llm_engine.py:53] IndexError: list index out of range

Thanks for reporting! Could you give me an minimal reproducible example since I can run gemma-2 with flashinfer cudagraph on my end. Thanks!

zifeitong · 2024-07-03T21:39:20Z

I am using the run_batch script:

python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request

requests.jsonl.zip

LiuXiaoxuanPKU · 2024-07-03T23:00:12Z

I am using the run_batch script:

python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request

requests.jsonl.zip

I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.

zifeitong · 2024-07-03T23:34:29Z

I am using the run_batch script:
python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request
requests.jsonl.zip
I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.

I am using H100 with CUDA 12.5. Can you try sync you branch to the latest? #4412 might be related (it refactors graph_runners).

LiuXiaoxuanPKU · 2024-07-04T04:08:34Z

I am using the run_batch script:
python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request
requests.jsonl.zip
I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.
I am using H100 with CUDA 12.5. Can you try sync you branch to the latest? #4412 might be related (it refactors graph_runners).

yes, it's a merge conflict. Just fixed, please try again. Thanks!

zifeitong · 2024-07-04T05:53:22Z

I am using the run_batch script:
python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request
requests.jsonl.zip
I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.
I am using H100 with CUDA 12.5. Can you try sync you branch to the latest? #4412 might be related (it refactors graph_runners).
yes, it's a merge conflict. Just fixed, please try again. Thanks!

Thanks for the fix. It works now, w/ or w/o CUDA graph.

…ect#6051) Co-authored-by: Simon Mo <simon.mo@hey.com>

…ect#6051) Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Alvant <alvasian@yandex.ru>

LiuXiaoxuanPKU added 2 commits July 1, 2024 16:23

logits_soft_cap for gemma2 in flashinfer

c7ddd18

separate tests

ceb7a16

LiuXiaoxuanPKU requested a review from WoosukKwon July 1, 2024 23:35

LiuXiaoxuanPKU added 3 commits July 1, 2024 16:37

minor

afa1ef3

format

6426e9b

format

8423d03

LiuXiaoxuanPKU requested a review from Yard1 July 2, 2024 00:30

Yard1 reviewed Jul 2, 2024

View reviewed changes

comaniac self-assigned this Jul 2, 2024

WoosukKwon mentioned this pull request Jul 2, 2024

v0.5.1 Release Tracker #5806

Closed

LiuXiaoxuanPKU added 2 commits July 2, 2024 13:10

add flashinfer unit test, update error message

54f8548

format

c0f298b

simon-mo reviewed Jul 2, 2024

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

Update .buildkite/test-pipeline.yaml

d248cef

Co-authored-by: Simon Mo <simon.mo@hey.com>

Yard1 approved these changes Jul 3, 2024

View reviewed changes

yongxb reviewed Jul 3, 2024

View reviewed changes

fix

0947961

comaniac approved these changes Jul 3, 2024

View reviewed changes

LiuXiaoxuanPKU added 5 commits July 3, 2024 20:55

remove warning

0cc481c

Merge branch 'main' of github.com:LiuXiaoxuanPKU/vllm

06b1a4f

Merge branch 'main' into flashinfer-logit-soft-cap

1f266c6

fix merge bug

cf02682

format

8e4d1d3

remove model test since unit tests already convered and unblock release

5257015

LiuXiaoxuanPKU merged commit 69ec3ca into vllm-project:main Jul 4, 2024
70 checks passed

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 7, 2024

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (vllm-proj…

fe526c1

…ect#6051) Co-authored-by: Simon Mo <simon.mo@hey.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (vllm-proj…

795dd5d

…ect#6051) Co-authored-by: Simon Mo <simon.mo@hey.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (vllm-proj…

931d92f

…ect#6051) Co-authored-by: Simon Mo <simon.mo@hey.com>

LiuXiaoxuanPKU deleted the flashinfer-logit-soft-cap branch September 17, 2024 04:29

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (vllm-proj…

0be5a9c

…ect#6051) Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Alvant <alvasian@yandex.ru>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

LiuXiaoxuanPKU commented Jul 1, 2024

Yard1 left a comment

Yard1 Jul 2, 2024

yongxb Jul 3, 2024

WoosukKwon Jul 3, 2024

zifeitong commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 4, 2024

zifeitong commented Jul 4, 2024

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

Conversation

LiuXiaoxuanPKU commented Jul 1, 2024

Yard1 left a comment

Choose a reason for hiding this comment

Yard1 Jul 2, 2024

Choose a reason for hiding this comment

yongxb Jul 3, 2024

Choose a reason for hiding this comment

WoosukKwon Jul 3, 2024

Choose a reason for hiding this comment

zifeitong commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 4, 2024

zifeitong commented Jul 4, 2024