[core][distributed] fix ray worker rank assignment #6235

youkaichao · 2024-07-09T00:30:20Z

looks like ray does not guarantee the ranks align with machine boundary. when this happens, tensor parallel will happen across machines, which will bring terrible performance.

andoorve · 2024-07-09T02:35:05Z

vllm/executor/ray_gpu_executor.py

+        # the machine boundaries. We need to make sure that workers in the
+        # same node are assigned consecutive ranks.
+        # examples:
+        # [('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [0]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [0]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [1]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [2]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [3]), ('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [1]), ('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [2]), ('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [3])] # noqa


Are we guaranteed that the GPU IDs for a given node will be in strictly increasing order?

I don't think so. this is not needed, either. the CUDA_VISIBLE_DEVICES is ",".join(map(str, node_gpus[node_id])), and local rank is node_workers[node_id].index(rank), as long as these two are consistent, it should be fine.

node_gpus[node_id]: list of gpu ids for each worker in this node
node_workers[node_id]: list of worker ranks for each worker in this node

andoorve · 2024-07-09T04:28:42Z

vllm/executor/ray_gpu_executor.py

+        # [('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [0]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [0]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [1]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [2]), ('dfaad7adfdae57a694cc74490db45bd112c9f31243523e43ddc2e7f0', [3]), ('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [1]), ('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [2]), ('852a09a13c7503ef126d7c828454c741494b1be33a8627a5206604d9', [3])] # noqa
+
+        # initialize worker ranks with -1 (unassigned)
+        worker_ranks = [-1 for x in worker_node_and_gpu_ids]


You can do this simply by using python sort/sorted, but this can be very unreadable.

I wish ray can sort by node & GPU id for me :( but it cannot

Signed-off-by: Alvant <alvasian@yandex.ru>

youkaichao added 2 commits July 8, 2024 17:24

fix ray worker rank assignment

0d1ed5d

remove index, use rank only

38f4ce5

andoorve reviewed Jul 9, 2024

View reviewed changes

andoorve approved these changes Jul 9, 2024

View reviewed changes

andoorve reviewed Jul 9, 2024

View reviewed changes

youkaichao merged commit 70c232f into vllm-project:main Jul 9, 2024
70 checks passed

youkaichao deleted the fix_rank branch July 9, 2024 04:31

This was referenced Jul 13, 2024

[Bugfix] Multinode hang fix with PP #6399

Closed

[Bugfix] Fix for multinode crash on 4 PP #6495

Merged

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 17, 2024

[core][distributed] fix ray worker rank assignment (vllm-project#6235)

b31efb0

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[core][distributed] fix ray worker rank assignment (vllm-project#6235)

c4c7821

ruisearch42 mentioned this pull request Jul 31, 2024

[Core] Pipeline parallel with Ray ADAG #6837

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[core][distributed] fix ray worker rank assignment (vllm-project#6235)

bc57d7e

Signed-off-by: Alvant <alvasian@yandex.ru>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][distributed] fix ray worker rank assignment #6235

[core][distributed] fix ray worker rank assignment #6235

youkaichao commented Jul 9, 2024

andoorve Jul 9, 2024

youkaichao Jul 9, 2024

andoorve Jul 9, 2024

youkaichao Jul 9, 2024

[core][distributed] fix ray worker rank assignment #6235

[core][distributed] fix ray worker rank assignment #6235

Conversation

youkaichao commented Jul 9, 2024

andoorve Jul 9, 2024

Choose a reason for hiding this comment

youkaichao Jul 9, 2024

Choose a reason for hiding this comment

andoorve Jul 9, 2024

Choose a reason for hiding this comment

youkaichao Jul 9, 2024

Choose a reason for hiding this comment