-
Notifications
You must be signed in to change notification settings - Fork 558
[Feature] Add PD separation feature #432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "llm.SyncKvCacheWaitTime": "5000", | ||
| } | ||
| if self.role == LLMRole.PROMPT: | ||
| options["ge.exec.deviceId"] = str(self.local_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的deviceId有问题,使用非0开始的PROMPT_DEVICE_ID会出现错误
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的deviceId有问题,使用非0开始的PROMPT_DEVICE_ID会出现错误
这个是在测试脚本里通过os.environ["ASCEND_RT_VISIBLE_DEVICES"]进行控制的, PROMPT_DEVICE_ID, DECODE_DEVICE_ID尽量与测试脚本中的prompt和decode进程中的os.environ["ASCEND_RT_VISIBLE_DEVICES"]保持一致
| self.hidden_cache = torchair.llm_datadist.create_npu_tensors( | ||
| hidden_desc.shape, kv_hidden_dtype, hidden_buffer_addrs) | ||
|
|
||
| key_cache_key = CacheKeyByIdAndIndex(self.cluster.remote_cluster_id, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用这个接口对端是否不会释放?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用这个接口对端是否不会释放?
llmdatadist会自动释放
|
great job! Please make the CI happy first before a detail review. Thanks. |
vllm_ascend/worker/worker.py
Outdated
| local_rank: int = -1, | ||
| backend: str = "hccl") -> None: | ||
| """Initialize the distributed environment.""" | ||
| parallel_config = vllm_config.parallel_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use self.parallel_config here. No need to get it from vllm_config again. Because self.parallel_config is updated in __init__, if use parallel_config = vllm_config.parallel_config, some value may missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use
self.parallel_confighere. No need to get it fromvllm_configagain. Becauseself.parallel_configis updated in__init__, if useparallel_config = vllm_config.parallel_config, some value may missed.
Fixed
vllm_ascend/worker/worker_v1.py
Outdated
| distributed_init_method: Optional[str] = None, | ||
| local_rank: int = -1) -> None: | ||
| """Initialize the distributed environment.""" | ||
| parallel_config = vllm_config.parallel_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
|
||
| @classmethod | ||
| def check_and_update_config(cls, vllm_config: VllmConfig) -> None: | ||
| from vllm.config import CompilationLevel # noqa: E402 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this change?
#462
Save as this pr, to solve the e2e test failed in CI
|
|
||
| import torch | ||
| import torch_npu | ||
| import torchair # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between torchair.llm_datadist and llm_datadist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between
torchair.llm_datadistandllm_datadist
已经得到确认, torchair里面的llm_datadist只是创建torch的tensor接口
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just from torchair.llm_datadist import create_npu_tensors ?
| '{"kv_connector":"AscendHcclConnector","kv_buffer_device":"npu","kv_role":"kv_producer", "kv_parallel_size":2}' | ||
| ) | ||
|
|
||
| # Set GPU memory utilization to 0.8 for an A6000 GPU with 40GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to NPU or remove the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to NPU or remove the comment
fixed
| ) | ||
|
|
||
| # Set GPU memory utilization to 0.8 for an A6000 GPU with 40GB | ||
| # memory. You may need to adjust the value to fit your GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
fixed
| hidden = self.hidden_cache | ||
|
|
||
| # enumerate different requests | ||
| for idx, slen in enumerate(seq_lens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从代码来看,send/recv 都是按 batch 来一起处理的,decoder 怎么保证同一个 batch 的 request 在 decoder 中刚好还是同样顺序的一个 batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从代码来看,send/recv 都是按 batch 来一起处理的,decoder 怎么保证同一个 batch 的 request 在 decoder 中刚好还是同样顺序的一个 batch?
从输入来说, decoder的输入就应当与prefill节点保持一致, 也就是examples/offline_disaggregated_prefill_npu.py中prefill进程与decode进程两个的输入prompts都是一致的.
如果因为某些原因导致这两者不一致, 则会导致418行的校验不通过, bypass_model_exec被设为False, Decoder节点则会重新计算首字.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decoder的输入就应当与prefill节点保持一致
这不合理吧,prefill 和 decode 都是按 batch 进行处理的,但是两边组 batch 的行为是独立的,那么 prefill 的一个 batch 中的 n 个 request 可能因为各种原因不是同时到达 decode 端,就算碰巧同时到达放到一个 batch 里,也没有任何机制保证这个 batch 的 request 顺序和 prefill 中的 decode 顺序相同,所以这里应该 by request 做 kv 传输,而不是整个 batch 一起?
examples/offline_disaggregated_prefill_npu.py 中只写了一个 prompts,假设发起多个请求组成一个 batch 呢?那每次decode 都很大可能 bypass_model_exec = false,重新计算首字,基本 prefill 就废了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实是by request粒度发送的kvcache,llmdatadist在数据传输的时候就需要一个batch id。在send数据的时候我们拿到的就是一个request的tensor,为了传输过去,我们手动添加了一个batch维度(如input_shape = (1, input_tokens_tensor.shape[0], 1, 1)),但batchsize=1,因此实际上还是by request粒度
28498e3 to
5862b87
Compare
|
Hi, Thank you for your contribution to the community, it's a great job. But the current code does not work properly. Are there any problems? When I ran the PD separation example, I found a bug. It seems that AscendHcclConnector is not registered correctly, and I can see the registered code in "vllm_ascend/distributed/init.py", which doesn't seem to work. I clone code directly from the eeethenQ:main repository and configure the environment. There is a problem with running pd separation. Can you help me solve this problem? Or can the current PR work properly? Best regards, |
2567b74 to
810ce61
Compare
| } | ||
|
|
||
| # Get all device ips using hccn_tool | ||
| HCCN_TOOL_PATH = os.environ.get("HCCN_PATH", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to env.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to env.py
fixed
| return device_ip_list | ||
|
|
||
|
|
||
| class KVTransferEng: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KVTransferEngine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KVTransferEngine?
fixed
| self.cluster_id = local_rank | ||
| self.data_dist = llm_datadist.LLMDataDist(self.role, self.cluster_id) | ||
|
|
||
| prompt_device_ids = os.environ.get('PROMPT_DEVICE_ID', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, move to env.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, move to env.py
fixed
|
|
||
| def prepare_data_dist(self): | ||
| options = { | ||
| "llm.SyncKvCacheWaitTime": "5000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make 5000 configurable by env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make
5000configurable by env?
now there is an env attribute called LLMDATADIST_SYNC_CACHE_WAIT_TIME which replace the 5000 here
| if self.role == llm_datadist.LLMRole.PROMPT: | ||
| options["ge.exec.deviceId"] = str(self.local_rank) | ||
| options[ | ||
| "llm.listenIpInfo"] = f"{self.prompt_ip_list[self.local_rank]}:26000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 26000 here need hard code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is
26000here need hard code?
there is an env value called LLMDATADIST_COMM_PORT to configure the listenIP port
| options[ | ||
| "llm.listenIpInfo"] = f"{self.prompt_ip_list[self.local_rank]}:26000" | ||
| else: | ||
| # options["ge.exec.deviceId"] = str(self.local_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove uesless code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove uesless code comment.
fixed
| import torchair # type: ignore | ||
| from vllm.config import VllmConfig | ||
| from vllm.distributed.kv_transfer.kv_connector.base import KVConnectorBase | ||
| from vllm.logger import init_logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just from vllm.logger import logger see #515
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just
from vllm.logger import loggersee #515
fixed
|
|
||
| num_layer = end_layer - start_layer | ||
|
|
||
| # 此处需要拿到input_shape的shape与hiddenState的shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't write chinese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't write chinese.
fixed
|
|
||
| def close(self, ): | ||
| self.llm_datadist_engine.data_dist.unlink_clusters([self.cluster], | ||
| 5000) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this 5000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transfer port
|
And what version of torchair and llm_datadist is required, can they be installed from pypi? the requirement should be updated as well. |
目前TorchAir暂未提供单独的发布包,而是作为torch-npu的三方库,随着torch-npu一起发布,因此直接取torch-npu的在线发布包安装之后,即可使用TorchAir。llm_datadist随cann包发布 |
Good to know! I tested locally, llm_datadist works well. But |
|
please rebase and fix the conflicts |
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
…examples Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ShuaibingWang wangshuaibing3@huawei.com Signed-off-by: ZihuiQian <qianzihui@huawei.com>
|
CI is not stable. Merge this PR first. |
### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>
|
@eeethenQ how to solve this problem when self.llm_datadist_engine.kv_transfer.pull_cache error? INFO 04-27 09:26:29 [loader.py:458] Loading weights took 4.89 seconds INFO 04-27 09:26:30 [loader.py:458] Loading weights took 5.67 seconds |
### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>
What this PR does / why we need it?
Adapt Disaggregated Prefill feature onto Ascend device
Does this PR introduce any user-facing change?
no
How was this patch tested?
The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py
To run it, do this