Skip to content

Conversation

@eeethenQ
Copy link
Contributor

@eeethenQ eeethenQ commented Mar 29, 2025

What this PR does / why we need it?

Adapt Disaggregated Prefill feature onto Ascend device

Does this PR introduce any user-facing change?

no

How was this patch tested?

The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py
To run it, do this

export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py

"llm.SyncKvCacheWaitTime": "5000",
}
if self.role == LLMRole.PROMPT:
options["ge.exec.deviceId"] = str(self.local_rank)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的deviceId有问题,使用非0开始的PROMPT_DEVICE_ID会出现错误

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的deviceId有问题,使用非0开始的PROMPT_DEVICE_ID会出现错误

这个是在测试脚本里通过os.environ["ASCEND_RT_VISIBLE_DEVICES"]进行控制的, PROMPT_DEVICE_ID, DECODE_DEVICE_ID尽量与测试脚本中的prompt和decode进程中的os.environ["ASCEND_RT_VISIBLE_DEVICES"]保持一致

self.hidden_cache = torchair.llm_datadist.create_npu_tensors(
hidden_desc.shape, kv_hidden_dtype, hidden_buffer_addrs)

key_cache_key = CacheKeyByIdAndIndex(self.cluster.remote_cluster_id, 1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用这个接口对端是否不会释放?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用这个接口对端是否不会释放?

llmdatadist会自动释放

@wangxiyuan
Copy link
Collaborator

great job! Please make the CI happy first before a detail review. Thanks.

local_rank: int = -1,
backend: str = "hccl") -> None:
"""Initialize the distributed environment."""
parallel_config = vllm_config.parallel_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use self.parallel_config here. No need to get it from vllm_config again. Because self.parallel_config is updated in __init__, if use parallel_config = vllm_config.parallel_config, some value may missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use self.parallel_config here. No need to get it from vllm_config again. Because self.parallel_config is updated in __init__, if use parallel_config = vllm_config.parallel_config, some value may missed.

Fixed

distributed_init_method: Optional[str] = None,
local_rank: int = -1) -> None:
"""Initialize the distributed environment."""
parallel_config = vllm_config.parallel_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


@classmethod
def check_and_update_config(cls, vllm_config: VllmConfig) -> None:
from vllm.config import CompilationLevel # noqa: E402
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this change?

#462
Save as this pr, to solve the e2e test failed in CI


import torch
import torch_npu
import torchair # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between torchair.llm_datadist and llm_datadist

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between torchair.llm_datadist and llm_datadist

已经得到确认, torchair里面的llm_datadist只是创建torch的tensor接口

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just from torchair.llm_datadist import create_npu_tensors ?

'{"kv_connector":"AscendHcclConnector","kv_buffer_device":"npu","kv_role":"kv_producer", "kv_parallel_size":2}'
)

# Set GPU memory utilization to 0.8 for an A6000 GPU with 40GB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to NPU or remove the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to NPU or remove the comment

fixed

)

# Set GPU memory utilization to 0.8 for an A6000 GPU with 40GB
# memory. You may need to adjust the value to fit your GPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

fixed

hidden = self.hidden_cache

# enumerate different requests
for idx, slen in enumerate(seq_lens):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从代码来看,send/recv 都是按 batch 来一起处理的,decoder 怎么保证同一个 batch 的 request 在 decoder 中刚好还是同样顺序的一个 batch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从代码来看,send/recv 都是按 batch 来一起处理的,decoder 怎么保证同一个 batch 的 request 在 decoder 中刚好还是同样顺序的一个 batch?

从输入来说, decoder的输入就应当与prefill节点保持一致, 也就是examples/offline_disaggregated_prefill_npu.py中prefill进程与decode进程两个的输入prompts都是一致的.

如果因为某些原因导致这两者不一致, 则会导致418行的校验不通过, bypass_model_exec被设为False, Decoder节点则会重新计算首字.

Copy link
Contributor

@lidenghui1110 lidenghui1110 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoder的输入就应当与prefill节点保持一致

这不合理吧,prefill 和 decode 都是按 batch 进行处理的,但是两边组 batch 的行为是独立的,那么 prefill 的一个 batch 中的 n 个 request 可能因为各种原因不是同时到达 decode 端,就算碰巧同时到达放到一个 batch 里,也没有任何机制保证这个 batch 的 request 顺序和 prefill 中的 decode 顺序相同,所以这里应该 by request 做 kv 传输,而不是整个 batch 一起?
examples/offline_disaggregated_prefill_npu.py 中只写了一个 prompts,假设发起多个请求组成一个 batch 呢?那每次decode 都很大可能 bypass_model_exec = false,重新计算首字,基本 prefill 就废了

Copy link

@heartStrive1998 heartStrive1998 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实是by request粒度发送的kvcache,llmdatadist在数据传输的时候就需要一个batch id。在send数据的时候我们拿到的就是一个request的tensor,为了传输过去,我们手动添加了一个batch维度(如input_shape = (1, input_tokens_tensor.shape[0], 1, 1)),但batchsize=1,因此实际上还是by request粒度

@heartStrive1998 heartStrive1998 force-pushed the main branch 3 times, most recently from 28498e3 to 5862b87 Compare April 9, 2025 13:31
@baifanxxx
Copy link
Contributor

Hi,

Thank you for your contribution to the community, it's a great job. But the current code does not work properly. Are there any problems? When I ran the PD separation example, I found a bug.

It seems that AscendHcclConnector is not registered correctly, and I can see the registered code in "vllm_ascend/distributed/init.py", which doesn't seem to work.

I clone code directly from the eeethenQ:main repository and configure the environment. There is a problem with running pd separation. Can you help me solve this problem? Or can the current PR work properly?

Best regards,
BAI Fan

@heartStrive1998 heartStrive1998 force-pushed the main branch 2 times, most recently from 2567b74 to 810ce61 Compare April 11, 2025 03:23
}

# Get all device ips using hccn_tool
HCCN_TOOL_PATH = os.environ.get("HCCN_PATH",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to env.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to env.py

fixed

return device_ip_list


class KVTransferEng:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KVTransferEngine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KVTransferEngine?

fixed

self.cluster_id = local_rank
self.data_dist = llm_datadist.LLMDataDist(self.role, self.cluster_id)

prompt_device_ids = os.environ.get('PROMPT_DEVICE_ID', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, move to env.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, move to env.py

fixed


def prepare_data_dist(self):
options = {
"llm.SyncKvCacheWaitTime": "5000",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make 5000 configurable by env?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make 5000 configurable by env?

now there is an env attribute called LLMDATADIST_SYNC_CACHE_WAIT_TIME which replace the 5000 here

if self.role == llm_datadist.LLMRole.PROMPT:
options["ge.exec.deviceId"] = str(self.local_rank)
options[
"llm.listenIpInfo"] = f"{self.prompt_ip_list[self.local_rank]}:26000"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 26000 here need hard code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 26000 here need hard code?

there is an env value called LLMDATADIST_COMM_PORT to configure the listenIP port

options[
"llm.listenIpInfo"] = f"{self.prompt_ip_list[self.local_rank]}:26000"
else:
# options["ge.exec.deviceId"] = str(self.local_rank)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove uesless code comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove uesless code comment.

fixed

import torchair # type: ignore
from vllm.config import VllmConfig
from vllm.distributed.kv_transfer.kv_connector.base import KVConnectorBase
from vllm.logger import init_logger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just from vllm.logger import logger see #515

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just from vllm.logger import logger see #515

fixed


num_layer = end_layer - start_layer

# 此处需要拿到input_shape的shape与hiddenState的shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't write chinese.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't write chinese.

fixed


def close(self, ):
self.llm_datadist_engine.data_dist.unlink_clusters([self.cluster],
5000) No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this 5000?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transfer port

@wangxiyuan
Copy link
Collaborator

And what version of torchair and llm_datadist is required, can they be installed from pypi? the requirement should be updated as well.

@heartStrive1998
Copy link

And what version of torchair and llm_datadist is required, can they be installed from pypi? the requirement should be updated as well.

目前TorchAir暂未提供单独的发布包,而是作为torch-npu的三方库,随着torch-npu一起发布,因此直接取torch-npu的在线发布包安装之后,即可使用TorchAir。llm_datadist随cann包发布

@wangxiyuan
Copy link
Collaborator

And what version of torchair and llm_datadist is required, can they be installed from pypi? the requirement should be updated as well.

目前TorchAir暂未提供单独的发布包,而是作为torch-npu的三方库,随着torch-npu一起发布,因此直接取torch-npu的在线发布包安装之后,即可使用TorchAir。llm_datadist随cann包发布

Good to know! I tested locally, llm_datadist works well. But import torchair failed. I installed torch-npu== 2.5.1.dev20250320

@wangxiyuan
Copy link
Collaborator

please rebase and fix the conflicts

@github-actions github-actions bot added documentation Improvements or additions to documentation ci/build module:tests module:tools labels Apr 15, 2025
ZihuiQian added 8 commits April 15, 2025 10:52
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
ZihuiQian added 13 commits April 15, 2025 10:52
…examples

Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
ZihuiQian added 2 commits April 15, 2025 14:47
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
@wangxiyuan
Copy link
Collaborator

CI is not stable. Merge this PR first.

@wangxiyuan wangxiyuan merged commit 44a8301 into vllm-project:main Apr 15, 2025
11 of 15 checks passed
ttanzhiqiang pushed a commit to ttanzhiqiang/vllm-ascend that referenced this pull request Apr 27, 2025
### What this PR does / why we need it?
Adapt Disaggregated Prefill feature onto Ascend device

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

The test usage has been provided alongwith the PR, in
examples/offline_disaggregated_prefill_npu.py
To run it, do this
```
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
```

---------

Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
@gao12312
Copy link

@eeethenQ how to solve this problem when self.llm_datadist_engine.kv_transfer.pull_cache error?
`
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
INFO 04-27 09:24:22 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-27 09:24:22 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 04-27 09:24:22 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-27 09:24:22 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:22 [init.py:44] plugin ascend loaded.
INFO 04-27 09:24:22 [init.py:230] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-27 09:24:22 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-27 09:24:22 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 04-27 09:24:22 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-27 09:24:22 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:22 [init.py:44] plugin ascend loaded.
INFO 04-27 09:24:22 [init.py:230] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-27 09:24:25 [init.py:30] Available plugins for group vllm.general_plugins:
INFO 04-27 09:24:25 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-27 09:24:25 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-27 09:24:25 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:25 [init.py:44] plugin ascend_enhanced_model loaded.
INFO 04-27 09:24:25 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-27 09:24:25 [init.py:30] Available plugins for group vllm.general_plugins:
INFO 04-27 09:24:25 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-27 09:24:25 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-27 09:24:25 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-27 09:24:25 [init.py:44] plugin ascend_enhanced_model loaded.
INFO 04-27 09:24:25 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-27 09:24:25 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-27 09:24:25 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-27 09:24:25 [registry.py:380] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
INFO 04-27 09:24:36 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 04-27 09:24:36 [arg_utils.py:1731] --kv-transfer-config is not supported by the V1 Engine. Falling back to V0.
INFO 04-27 09:24:36 [config.py:1747] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 04-27 09:24:36 [platform.py:129] NPU compilation support pending. Will be available in future CANN and torch_npu releases. Using default: enforce_eager=True
INFO 04-27 09:24:36 [platform.py:134] Compilation disabled, using eager mode by default
INFO 04-27 09:24:36 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/models/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 04-27 09:24:37 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-27 09:24:37 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-150bf/VLLM_TRACE_FUNCTION_for_process_140521_thread_281472905824320_at_2025-04-27_09:24:37.164400.log
ERROR 04-27 09:24:37 [camem.py:69] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory
INFO 04-27 09:24:37 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 04-27 09:24:37 [arg_utils.py:1731] --kv-transfer-config is not supported by the V1 Engine. Falling back to V0.
INFO 04-27 09:24:37 [config.py:1747] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 04-27 09:24:37 [platform.py:129] NPU compilation support pending. Will be available in future CANN and torch_npu releases. Using default: enforce_eager=True
INFO 04-27 09:24:37 [platform.py:134] Compilation disabled, using eager mode by default
INFO 04-27 09:24:37 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/root/models/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/models/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 04-27 09:24:38 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-27 09:24:38 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-2f63d/VLLM_TRACE_FUNCTION_for_process_140520_thread_281472905824320_at_2025-04-27_09:24:38.341001.log
ERROR 04-27 09:24:38 [camem.py:69] Failed to import vllm_ascend_C:libvllm_ascend_kernels.so: cannot open shared object file: No such file or directory
WARNING 04-27 09:24:51 [utils.py:2444] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdb78c5e10>
WARNING 04-27 09:24:51 [utils.py:2444] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdb78c5d80>
[rank0]:[W427 09:25:19.092134234 ProcessGroupGloo.cpp:715] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W427 09:25:20.160130258 ProcessGroupGloo.cpp:715] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 04-27 09:25:49 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-27 09:25:50 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-27 09:26:19 [llmdatadist_connector.py:115] 0/1 rank data dist is ready
INFO 04-27 09:26:21 [llmdatadist_connector.py:115] 0/1 rank data dist is ready
INFO 04-27 09:26:21 [model_runner.py:944] Starting to load model /root/models/DeepSeek-R1-Distill-Qwen-1.5B...
INFO 04-27 09:26:21 [llmdatadist_connector.py:165] local_rank 0 link, ret=[<LLMStatusCode.LLM_SUCCESS: 0>]
INFO 04-27 09:26:21 [model_runner.py:944] Starting to load model /root/models/DeepSeek-R1-Distill-Qwen-1.5B...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.66s/it]

INFO 04-27 09:26:29 [loader.py:458] Loading weights took 4.89 seconds
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:05<00:00, 5.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:05<00:00, 5.33s/it]

INFO 04-27 09:26:30 [loader.py:458] Loading weights took 5.67 seconds
INFO 04-27 09:26:31 [model_runner.py:949] Loading model weights took 3.3461 GB
INFO 04-27 09:26:32 [model_runner.py:949] Loading model weights took 3.3461 GB
INFO 04-27 09:26:48 [executor_base.py:112] # npu blocks: 9653, # CPU blocks: 1170
INFO 04-27 09:26:48 [executor_base.py:117] Maximum concurrency for 2000 tokens per request: 617.79x
INFO 04-27 09:26:49 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 18.01 seconds
INFO 04-27 09:26:51 [executor_base.py:112] # npu blocks: 9653, # CPU blocks: 1170
INFO 04-27 09:26:51 [executor_base.py:117] Maximum concurrency for 2000 tokens per request: 617.79x
INFO 04-27 09:26:51 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 19.39 seconds
Processed prompts: 100%|████████████████████████████████| 4/4 [00:01<00:00, 3.56it/s, est. speed input: 27.71 toks/s, output: 3.57 toks/s]
Prefill node is finished.
Waiting for prefill node to finish...
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3.10/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.10/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/vllm-ascend-main-0425/examples/offline_disaggregated_prefill_npu.py", line 107, in run_decode
outputs = llm.generate(prompts, sampling_params)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/utils.py", line 1134, in inner
return fn(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 470, in generate
outputs = self._run_engine(use_tqdm=use_tqdm)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1409, in _run_engine
step_outputs = self.llm_engine.step()
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1431, in step
outputs = self.model_executor.execute_model(
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 140, in execute_model
output = self.collective_rpc("execute_model",
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/utils.py", line 2378, in run_method
return func(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
output = self.model_runner.execute_model(
File "/usr/local/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm-ascend-main-0425/vllm_ascend/worker/model_runner.py", line 1330, in execute_model
get_kv_transfer_group().recv_kv_caches_and_hidden_states(
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 75, in recv_kv_caches_and_hidden_states
return self.connector.recv_kv_caches_and_hidden_states(
File "/workspace/vllm-ascend-main-0425/vllm_ascend/distributed/llmdatadist_connector.py", line 386, in recv_kv_caches_and_hidden_states
self.llm_datadist_engine.kv_transfer.pull_cache(
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/llm_datadist/v2/kv_cache_manager.py", line 267, in pull_cache
handle_llm_status(ret, '[pull_cache]', cache_key)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/llm_datadist/status.py", line 121, in handle_llm_status
raise LLMException(f"{func_name} failed, error code is {code_2_status(status)}, {other_info}.",
llm_datadist.status.LLMException: [pull_cache] failed, error code is LLMStatusCode.LLM_KV_CACHE_NOT_EXIST, CacheKeyByIdAndIndex(cluster_id=0, cache_id=1, batch_index=0).
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Cleanup prefill resources
All process done!
`

@MengqingCao MengqingCao mentioned this pull request May 14, 2025
13 tasks
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Adapt Disaggregated Prefill feature onto Ascend device

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

The test usage has been provided alongwith the PR, in
examples/offline_disaggregated_prefill_npu.py
To run it, do this
```
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
```

---------

Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants