-
Notifications
You must be signed in to change notification settings - Fork 530
Description
Why use AscendScheduler in vLLM Ascend
We could enable AscendScheduler to accelerate inference when using V1 engine.
AscendScheduler is a V0-style scheduling schema that divides requests into prefill and decode for processing. In this way, after enabling AscendScheduler, V1 requests will be divided into prefill requests, decode requests, and mixed requests. Since the attention operator used by prefill and decode performs better than that used by mixed requests, it will bring performance improvement.
How to use AscendScheduler in vLLM Ascend
Add ascend_scheduler_config to additional_config when creating a LLM will enable AscendScheduler while using V1.
Please refer to the following example:
import os
from vllm import LLM, SamplingParams
# Enable V1Engine
os.environ["VLLM_USE_V1"] = "1"
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM with AscendScheduler
llm = LLM(
model="Qwen/Qwen2.5-0.5B-Instruct",
additional_config={
'ascend_scheduler_config': {},
},
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")Advanced
If you want to enable chunked-prefill in AscendScheduler, set additional_config={"ascend_scheduler_config": {"enable_chunked_prefill": True}}
Note
The performance may deteriorate if chunked-prefill is enabled currently.