-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spec Decode] feat: support LoRA with speculative decoding #11966
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
0d00454
to
4224f0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Overall LGTM.
For the test, can you change the prompt so that it can pass and does not have any numerical issue?
Yes, I tried different prompts, but they produced slightly different results even with greedy decoding. I wanted to calculate LoRA using FP32 to address potential numerical issues, but it seems LoRA weights only support FP16 or BF16. I'm not sure if the differences are caused by my code, another problem, or numerical issues. I will investigate further. |
@@ -367,7 +367,7 @@ def _create_single_target_seq_group_metadata( | |||
block_tables={ | |||
target_seq_id: seq_group_metadata.block_tables[seq_id], | |||
}, | |||
lora_request=None, | |||
lora_request=seq_group_metadata.lora_request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LiuXiaoxuanPKU
I found the reason why the test failed.
the scorer with both batch expansion and mqa scoring set lora_request to None.
I fixed this and the test passed successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous test results, only the prefill stage utilized the LoRA request, while the decode stage did not apply the LoRA operation.
@@ -57,7 +57,7 @@ def score_proposals( | |||
block_tables={ | |||
target_seq_id: seq_group_metadata.block_tables[seq_id], | |||
}, | |||
lora_request=None, | |||
lora_request=seq_group_metadata.lora_request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with batch expansion
del engine | ||
|
||
# run speculative decoding with mqa scorer. | ||
engine_args = EngineArgs(model=MODEL_PATH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test specifically for MQA scoring.
7cf636f
to
aaca3a5
Compare
This pull request has merge conflicts that must be resolved before it can be |
e4c599a
to
6a9c8a0
Compare
@LiuXiaoxuanPKU @sroy745 |
1b75337
to
60811d4
Compare
test error log [2025-02-22T11:46:06Z] metrics/test_metrics.py::test_async_engine_log_metrics_regression[True-4-half-distilbert/distilgpt2] ERROR 02-22 03:46:06 config.py:102] Error retrieving file list: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/models/distilbert/distilgpt2/tree/main?recursive=True&expand=False, retrying 1 of 2
--
| [2025-02-22T11:46:18Z] ERROR 02-22 03:46:18 config.py:100] Error retrieving file list: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/models/distilbert/distilgpt2/tree/main?recursive=True&expand=False
| [2025-02-22T11:46:18Z] FAILED
...
[2025-02-22T11:50:47Z] metrics/test_metrics.py:181:
--
| [2025-02-22T11:50:47Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py:639: in from_engine_args
| [2025-02-22T11:50:47Z] engine_config = engine_args.create_engine_config(usage_context)
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py:1144: in create_engine_config
| [2025-02-22T11:50:47Z] model_config = self.create_model_config()
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py:1064: in create_model_config
| [2025-02-22T11:50:47Z] return ModelConfig(
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/config.py:314: in __init__
| [2025-02-22T11:50:47Z] hf_config = get_config(self.model, trust_remote_code, revision,
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py:256: in get_config
| [2025-02-22T11:50:47Z] if is_gguf or file_or_path_exists(
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py:179: in file_or_path_exists
| [2025-02-22T11:50:47Z] return file_exists(str(model),
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py:154: in file_exists
| [2025-02-22T11:50:47Z] file_list = list_repo_files(repo_id,
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py:143: in list_repo_files
| [2025-02-22T11:50:47Z] return with_retry(lookup_files, "Error retrieving file list")
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py:97: in with_retry
| [2025-02-22T11:50:47Z] return func()
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py:133: in lookup_files
| [2025-02-22T11:50:47Z] return hf_list_repo_files(repo_id,
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py:114: in _inner_fn
| [2025-02-22T11:50:47Z] return fn(*args, **kwargs)
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/huggingface_hub/hf_api.py:2945: in list_repo_files
| [2025-02-22T11:50:47Z] for f in self.list_repo_tree(
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/huggingface_hub/hf_api.py:3080: in list_repo_tree
| [2025-02-22T11:50:47Z] for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
| [2025-02-22T11:50:47Z] /usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_pagination.py:37: in paginate
| [2025-02-22T11:50:47Z] hf_raise_for_status(r)
| [2025-02-22T11:50:47Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
|
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
60811d4
to
c6bc6f1
Compare
Summary
Implementation
SpecDecodeWorker
. If this adjustment is not made, the following error occurs:Test
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-2-7b-hf \ --port 8080 \ --disable-custom-all-reduce \ --swap-space 0 \ --gpu-memory-utilization 0.9 \ --enable-lora \ --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/ \ --speculative_model JackFram/llama-68m \ --num_speculative_tokens 3