[Feature] Add PD separation feature #432

eeethenQ · 2025-03-29T08:00:13Z

What this PR does / why we need it?

Adapt Disaggregated Prefill feature onto Ascend device

Does this PR introduce any user-facing change?

no

How was this patch tested?

The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py
To run it, do this

export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py

GuoweiWangU · 2025-04-01T08:37:38Z

vllm_ascend/distributed/llmdatadist_connector.py

+            "llm.SyncKvCacheWaitTime": "5000",
+        }
+        if self.role == LLMRole.PROMPT:
+            options["ge.exec.deviceId"] = str(self.local_rank)


这里的deviceId有问题，使用非0开始的PROMPT_DEVICE_ID会出现错误

这里的deviceId有问题，使用非0开始的PROMPT_DEVICE_ID会出现错误

这个是在测试脚本里通过os.environ["ASCEND_RT_VISIBLE_DEVICES"]进行控制的, PROMPT_DEVICE_ID, DECODE_DEVICE_ID尽量与测试脚本中的prompt和decode进程中的os.environ["ASCEND_RT_VISIBLE_DEVICES"]保持一致

GuoweiWangU · 2025-04-01T08:57:52Z

vllm_ascend/distributed/llmdatadist_connector.py

+        self.hidden_cache = torchair.llm_datadist.create_npu_tensors(
+            hidden_desc.shape, kv_hidden_dtype, hidden_buffer_addrs)
+
+        key_cache_key = CacheKeyByIdAndIndex(self.cluster.remote_cluster_id, 1,


使用这个接口对端是否不会释放？

使用这个接口对端是否不会释放？

llmdatadist会自动释放

wangxiyuan · 2025-04-02T06:28:02Z

great job! Please make the CI happy first before a detail review. Thanks.

wangxiyuan · 2025-04-07T03:24:02Z

vllm_ascend/worker/worker.py

            local_rank: int = -1,
            backend: str = "hccl") -> None:
        """Initialize the distributed environment."""
+        parallel_config = vllm_config.parallel_config


just use self.parallel_config here. No need to get it from vllm_config again. Because self.parallel_config is updated in __init__, if use parallel_config = vllm_config.parallel_config, some value may missed.

just use self.parallel_config here. No need to get it from vllm_config again. Because self.parallel_config is updated in __init__, if use parallel_config = vllm_config.parallel_config, some value may missed.

Fixed

wangxiyuan · 2025-04-07T03:24:27Z

vllm_ascend/worker/worker_v1.py

        distributed_init_method: Optional[str] = None,
        local_rank: int = -1) -> None:
    """Initialize the distributed environment."""
+    parallel_config = vllm_config.parallel_config


wangxiyuan · 2025-04-07T03:24:50Z

vllm_ascend/platform.py


    @classmethod
    def check_and_update_config(cls, vllm_config: VllmConfig) -> None:
+        from vllm.config import CompilationLevel  # noqa: E402


Any reason for this change?

Any reason for this change?

#462
Save as this pr, to solve the e2e test failed in CI

wangxiyuan · 2025-04-07T03:27:16Z

vllm_ascend/distributed/llmdatadist_connector.py

+
+import torch
+import torch_npu
+import torchair  # type: ignore


what's the difference between torchair.llm_datadist and llm_datadist

what's the difference between torchair.llm_datadist and llm_datadist

已经得到确认, torchair里面的llm_datadist只是创建torch的tensor接口

How about just from torchair.llm_datadist import create_npu_tensors ?

wangxiyuan · 2025-04-07T03:28:04Z

examples/offline_disaggregated_prefill_npu.py

+        '{"kv_connector":"AscendHcclConnector","kv_buffer_device":"npu","kv_role":"kv_producer", "kv_parallel_size":2}'
+    )
+
+    # Set GPU memory utilization to 0.8 for an A6000 GPU with 40GB


rename to NPU or remove the comment

rename to NPU or remove the comment

fixed

wangxiyuan · 2025-04-07T03:28:11Z

examples/offline_disaggregated_prefill_npu.py

+    )
+
+    # Set GPU memory utilization to 0.8 for an A6000 GPU with 40GB
+    # memory. You may need to adjust the value to fit your GPU.


ditto

fixed

lidenghui1110 · 2025-04-08T12:50:23Z

vllm_ascend/distributed/llmdatadist_connector.py

+        hidden = self.hidden_cache
+
+        # enumerate different requests
+        for idx, slen in enumerate(seq_lens):


从代码来看，send/recv 都是按 batch 来一起处理的，decoder 怎么保证同一个 batch 的 request 在 decoder 中刚好还是同样顺序的一个 batch？

从代码来看，send/recv 都是按 batch 来一起处理的，decoder 怎么保证同一个 batch 的 request 在 decoder 中刚好还是同样顺序的一个 batch？

从输入来说, decoder的输入就应当与prefill节点保持一致, 也就是examples/offline_disaggregated_prefill_npu.py中prefill进程与decode进程两个的输入prompts都是一致的.

如果因为某些原因导致这两者不一致, 则会导致418行的校验不通过, bypass_model_exec被设为False, Decoder节点则会重新计算首字.

decoder的输入就应当与prefill节点保持一致

这不合理吧，prefill 和 decode 都是按 batch 进行处理的，但是两边组 batch 的行为是独立的，那么 prefill 的一个 batch 中的 n 个 request 可能因为各种原因不是同时到达 decode 端，就算碰巧同时到达放到一个 batch 里，也没有任何机制保证这个 batch 的 request 顺序和 prefill 中的 decode 顺序相同，所以这里应该 by request 做 kv 传输，而不是整个 batch 一起？
examples/offline_disaggregated_prefill_npu.py 中只写了一个 prompts，假设发起多个请求组成一个 batch 呢？那每次decode 都很大可能 bypass_model_exec = false，重新计算首字，基本 prefill 就废了

确实是by request粒度发送的kvcache，llmdatadist在数据传输的时候就需要一个batch id。在send数据的时候我们拿到的就是一个request的tensor，为了传输过去，我们手动添加了一个batch维度（如input_shape = (1, input_tokens_tensor.shape[0], 1, 1)），但batchsize=1，因此实际上还是by request粒度

baifanxxx · 2025-04-10T09:03:53Z

Hi,

Thank you for your contribution to the community, it's a great job. But the current code does not work properly. Are there any problems? When I ran the PD separation example, I found a bug.

It seems that AscendHcclConnector is not registered correctly, and I can see the registered code in "vllm_ascend/distributed/init.py", which doesn't seem to work.

I clone code directly from the eeethenQ:main repository and configure the environment. There is a problem with running pd separation. Can you help me solve this problem? Or can the current PR work properly?

Best regards,
BAI Fan

wangxiyuan · 2025-04-14T06:28:33Z

vllm_ascend/distributed/llmdatadist_connector.py

+}
+
+# Get all device ips using hccn_tool
+HCCN_TOOL_PATH = os.environ.get("HCCN_PATH",


move to env.py

move to env.py

fixed

wangxiyuan · 2025-04-14T06:29:14Z

vllm_ascend/distributed/llmdatadist_connector.py

+    return device_ip_list
+
+
+class KVTransferEng:


KVTransferEngine?

KVTransferEngine?

fixed

wangxiyuan · 2025-04-14T06:29:53Z

vllm_ascend/distributed/llmdatadist_connector.py

+        self.cluster_id = local_rank
+        self.data_dist = llm_datadist.LLMDataDist(self.role, self.cluster_id)
+
+        prompt_device_ids = os.environ.get('PROMPT_DEVICE_ID', None)


ditto, move to env.py

ditto, move to env.py

fixed

wangxiyuan · 2025-04-14T06:30:31Z

vllm_ascend/distributed/llmdatadist_connector.py

+
+    def prepare_data_dist(self):
+        options = {
+            "llm.SyncKvCacheWaitTime": "5000",


make 5000 configurable by env?

make 5000 configurable by env?

now there is an env attribute called LLMDATADIST_SYNC_CACHE_WAIT_TIME which replace the 5000 here

wangxiyuan · 2025-04-14T06:31:19Z

vllm_ascend/distributed/llmdatadist_connector.py

+        if self.role == llm_datadist.LLMRole.PROMPT:
+            options["ge.exec.deviceId"] = str(self.local_rank)
+            options[
+                "llm.listenIpInfo"] = f"{self.prompt_ip_list[self.local_rank]}:26000"


is 26000 here need hard code?

is 26000 here need hard code?

there is an env value called LLMDATADIST_COMM_PORT to configure the listenIP port

wangxiyuan · 2025-04-14T06:31:33Z

vllm_ascend/distributed/llmdatadist_connector.py

+            options[
+                "llm.listenIpInfo"] = f"{self.prompt_ip_list[self.local_rank]}:26000"
+        else:
+            # options["ge.exec.deviceId"] = str(self.local_rank)


remove uesless code comment.

remove uesless code comment.

fixed

wangxiyuan · 2025-04-14T06:32:12Z

vllm_ascend/distributed/llmdatadist_connector.py

+import torchair  # type: ignore
+from vllm.config import VllmConfig
+from vllm.distributed.kv_transfer.kv_connector.base import KVConnectorBase
+from vllm.logger import init_logger


just from vllm.logger import logger see #515

just from vllm.logger import logger see #515

fixed

wangxiyuan · 2025-04-14T06:32:58Z

vllm_ascend/distributed/llmdatadist_connector.py

+
+        num_layer = end_layer - start_layer
+
+        # 此处需要拿到input_shape的shape与hiddenState的shape


don't write chinese.

don't write chinese.

fixed

wangxiyuan · 2025-04-14T06:34:02Z

vllm_ascend/distributed/llmdatadist_connector.py

+
+    def close(self, ):
+        self.llm_datadist_engine.data_dist.unlink_clusters([self.cluster],
+                                                           5000)


what's this 5000?

transfer port

wangxiyuan · 2025-04-14T06:43:11Z

And what version of torchair and llm_datadist is required, can they be installed from pypi? the requirement should be updated as well.

heartStrive1998 · 2025-04-14T07:28:24Z

And what version of torchair and llm_datadist is required, can they be installed from pypi? the requirement should be updated as well.

目前TorchAir暂未提供单独的发布包，而是作为torch-npu的三方库，随着torch-npu一起发布，因此直接取torch-npu的在线发布包安装之后，即可使用TorchAir。llm_datadist随cann包发布

wangxiyuan · 2025-04-14T08:59:09Z

And what version of torchair and llm_datadist is required, can they be installed from pypi? the requirement should be updated as well.

目前TorchAir暂未提供单独的发布包，而是作为torch-npu的三方库，随着torch-npu一起发布，因此直接取torch-npu的在线发布包安装之后，即可使用TorchAir。llm_datadist随cann包发布

Good to know! I tested locally, llm_datadist works well. But import torchair failed. I installed torch-npu== 2.5.1.dev20250320

wangxiyuan · 2025-04-15T00:45:30Z

please rebase and fix the conflicts

Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

…examples Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

wangxiyuan · 2025-04-15T07:11:24Z

CI is not stable. Merge this PR first.

### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>

gao12312 · 2025-04-27T09:39:56Z

@eeethenQ how to solve this problem when self.llm_datadist_engine.kv_transfer.pull_cache error?
`
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
INFO 04-27 09:24:22 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-27 09:24:22 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 04-27 09:24:22 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-27 09:24:22 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:22 [init.py:44] plugin ascend loaded.
INFO 04-27 09:24:22 [init.py:230] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-27 09:24:22 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-27 09:24:22 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 04-27 09:24:22 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-27 09:24:22 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:22 [init.py:44] plugin ascend loaded.
INFO 04-27 09:24:22 [init.py:230] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-27 09:24:25 [init.py:30] Available plugins for group vllm.general_plugins:
INFO 04-27 09:24:25 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-27 09:24:25 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-27 09:24:25 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:25 [init.py:44] plugin ascend_enhanced_model loaded.
INFO 04-27 09:24:25 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-27 09:24:25 [init.py:30] Available plugins for group vllm.general_plugins:
INFO 04-27 09:24:25 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-27 09:24:25 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-27 09:24:25 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:25 [init.py:44] plugin ascend_enhanced_model loaded.
INFO 04-27 09:24:25 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-27 09:24:25 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-27 09:24:25 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
INFO 04-27 09:24:36 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 04-27 09:24:36 [arg_utils.py:1731] --kv-transfer-config is not supported by the V1 Engine. Falling back to V0.
INFO 04-27 09:24:36 [config.py:1747] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 04-27 09:24:36 [platform.py:129] NPU compilation support pending. Will be available in future CANN and torch_npu releases. Using default: enforce_eager=True
INFO 04-27 09:24:36 [platform.py:134] Compilation disabled, using eager mode by default
INFO 04-27 09:24:36 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/models/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 04-27 09:24:37 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-27 09:24:37 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-150bf/VLLM_TRACE_FUNCTION_for_process_140521_thread_281472905824320_at_2025-04-27_09:24:37.164400.log
ERROR 04-27 09:24:37 [camem.py:69] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory
INFO 04-27 09:24:37 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 04-27 09:24:37 [arg_utils.py:1731] --kv-transfer-config is not supported by the V1 Engine. Falling back to V0.
INFO 04-27 09:24:37 [config.py:1747] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 04-27 09:24:37 [platform.py:129] NPU compilation support pending. Will be available in future CANN and torch_npu releases. Using default: enforce_eager=True
INFO 04-27 09:24:37 [platform.py:134] Compilation disabled, using eager mode by default
INFO 04-27 09:24:37 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/models/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 04-27 09:24:38 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-27 09:24:38 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-2f63d/VLLM_TRACE_FUNCTION_for_process_140520_thread_281472905824320_at_2025-04-27_09:24:38.341001.log
ERROR 04-27 09:24:38 [camem.py:69] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory
WARNING 04-27 09:24:51 [utils.py:2444] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdb78c5e10>
WARNING 04-27 09:24:51 [utils.py:2444] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdb78c5d80>
[rank0]:[W427 09:25:19.092134234 ProcessGroupGloo.cpp:715] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W427 09:25:20.160130258 ProcessGroupGloo.cpp:715] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 04-27 09:25:49 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-27 09:25:50 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-27 09:26:19 [llmdatadist_connector.py:115] 0/1 rank data dist is ready
INFO 04-27 09:26:21 [llmdatadist_connector.py:115] 0/1 rank data dist is ready
INFO 04-27 09:26:21 [model_runner.py:944] Starting to load model /root/models/DeepSeek-R1-Distill-Qwen-1.5B...
INFO 04-27 09:26:21 [llmdatadist_connector.py:165] local_rank 0 link, ret=[<LLMStatusCode.LLM_SUCCESS: 0>]
INFO 04-27 09:26:21 [model_runner.py:944] Starting to load model /root/models/DeepSeek-R1-Distill-Qwen-1.5B...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.66s/it]

INFO 04-27 09:26:29 [loader.py:458] Loading weights took 4.89 seconds
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:05<00:00, 5.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:05<00:00, 5.33s/it]

INFO 04-27 09:26:30 [loader.py:458] Loading weights took 5.67 seconds
INFO 04-27 09:26:31 [model_runner.py:949] Loading model weights took 3.3461 GB
INFO 04-27 09:26:32 [model_runner.py:949] Loading model weights took 3.3461 GB
INFO 04-27 09:26:48 [executor_base.py:112] # npu blocks: 9653, # CPU blocks: 1170
INFO 04-27 09:26:48 [executor_base.py:117] Maximum concurrency for 2000 tokens per request: 617.79x
INFO 04-27 09:26:49 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 18.01 seconds
INFO 04-27 09:26:51 [executor_base.py:112] # npu blocks: 9653, # CPU blocks: 1170
INFO 04-27 09:26:51 [executor_base.py:117] Maximum concurrency for 2000 tokens per request: 617.79x
INFO 04-27 09:26:51 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 19.39 seconds
Processed prompts: 100%|████████████████████████████████| 4/4 [00:01<00:00, 3.56it/s, est. speed input: 27.71 toks/s, output: 3.57 toks/s]
Prefill node is finished.
Waiting for prefill node to finish...
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3.10/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.10/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/vllm-ascend-main-0425/examples/offline_disaggregated_prefill_npu.py", line 107, in run_decode
outputs = llm.generate(prompts, sampling_params)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/utils.py", line 1134, in inner
return fn(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 470, in generate
outputs = self._run_engine(use_tqdm=use_tqdm)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1409, in _run_engine
step_outputs = self.llm_engine.step()
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1431, in step
outputs = self.model_executor.execute_model(
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 140, in execute_model
output = self.collective_rpc("execute_model",
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/utils.py", line 2378, in run_method
return func(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
output = self.model_runner.execute_model(
File "/usr/local/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm-ascend-main-0425/vllm_ascend/worker/model_runner.py", line 1330, in execute_model
get_kv_transfer_group().recv_kv_caches_and_hidden_states(
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 75, in recv_kv_caches_and_hidden_states
return self.connector.recv_kv_caches_and_hidden_states(
File "/workspace/vllm-ascend-main-0425/vllm_ascend/distributed/llmdatadist_connector.py", line 386, in recv_kv_caches_and_hidden_states
self.llm_datadist_engine.kv_transfer.pull_cache(
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/llm_datadist/v2/kv_cache_manager.py", line 267, in pull_cache
handle_llm_status(ret, '[pull_cache]', cache_key)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/llm_datadist/status.py", line 121, in handle_llm_status
raise LLMException(f"{func_name} failed, error code is {code_2_status(status)}, {other_info}.",
llm_datadist.status.LLMException: [pull_cache] failed, error code is LLMStatusCode.LLM_KV_CACHE_NOT_EXIST, CacheKeyByIdAndIndex(cluster_id=0, cache_id=1, batch_index=0).
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Cleanup prefill resources
All process done!
`

### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>

github-actions bot added the module:core label Mar 29, 2025

GuoweiWangU reviewed Apr 1, 2025

View reviewed changes

eeethenQ force-pushed the main branch from f88399d to 9f8ddee Compare April 3, 2025 03:34

wangxiyuan reviewed Apr 7, 2025

View reviewed changes

lidenghui1110 reviewed Apr 8, 2025

View reviewed changes

heartStrive1998 force-pushed the main branch 3 times, most recently from 28498e3 to 5862b87 Compare April 9, 2025 13:31

heartStrive1998 force-pushed the main branch 2 times, most recently from 2567b74 to 810ce61 Compare April 11, 2025 03:23

wuhuikx approved these changes Apr 14, 2025

View reviewed changes

wangxiyuan reviewed Apr 14, 2025

View reviewed changes

github-actions bot added documentation Improvements or additions to documentation ci/build module:tests module:tools labels Apr 15, 2025

ZihuiQian added 8 commits April 15, 2025 10:52

Add PD separation feature

0ad348a

Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

solve codecheck/ruff issue

9186819

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

fix import sorting issue

6868285

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

fix mypy import-not-found error

e473113

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

fix isort error

5873a4c

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

fix yapf error

8bf6b6f

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

import error workaround

47e2ce1

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

fix yapf error

3622543

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

ZihuiQian added 13 commits April 15, 2025 10:52

Modify parallel_config in worker.py and remove misleading comment in …

fc293a8

…examples Signed-off-by: ZihuiQian <qianzihui@huawei.com>

restart CI process

a6dd91b

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

restart CI process

804ecdd

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

restart CI process

6c841b9

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

add co-author

9539833

Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

修复环境变量问题

c9128c0

Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

Solve conflict

6395bbb

Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

solve codecheck/ruff issue

dd68e6d

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

restart CI process

f4e3074

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

restart CI process

0d3c3e0

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

restart CI process

90fe93c

Signed-off-by: ZihuiQian <qianzihui@huawei.com>

add co-author

7210589

Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

fixed yapf issue

39c7317

Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

eeethenQ force-pushed the main branch from 3437a81 to 39c7317 Compare April 15, 2025 03:08

github-actions bot removed documentation Improvements or additions to documentation ci/build module:tests module:tools labels Apr 15, 2025

ZihuiQian added 2 commits April 15, 2025 14:47

restart CI process

95cd36a

Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

rebase worker_v1.py

8b1a922

Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>

wangxiyuan approved these changes Apr 15, 2025

View reviewed changes

wangxiyuan merged commit 44a8301 into vllm-project:main Apr 15, 2025
11 of 15 checks passed

MengqingCao mentioned this pull request May 14, 2025

[RFC]: P/D Disaggregation Support #841

Closed

13 tasks


		num_layer = end_layer - start_layer

		# 此处需要拿到input_shape的shape与hiddenState的shape

[Feature] Add PD separation feature #432

[Feature] Add PD separation feature #432

Uh oh!

Conversation

eeethenQ commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Apr 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidenghui1110 Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heartStrive1998 Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baifanxxx commented Apr 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eeethenQ commented Mar 29, 2025 •

edited

Loading

lidenghui1110 Apr 10, 2025 •

edited

Loading

heartStrive1998 Apr 11, 2025 •

edited

Loading