- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Misc] support multi-node data parallel #15863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
| vllm_config = get_current_vllm_config() | ||
| use_ep = (vllm_config.parallel_config.enable_expert_parallel | ||
| and self.tp_size > 1) | ||
| and self.tp_size * self.dp_size > 1) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix is good. Without it, everything will work OK but the MoE layers will run with TP instead of EP (pretty sure that's true at least). Probably worth putting into its own PR
Thanks for catching this!
| Hi @zxfan-cpu, thanks for this. I'm working on non-ray multi-node support first, which should be ready within the next day or so. It would be great to add ray support after that. Could you open a separate PR with the fix that @tlrmchlsmth commented on above? I will also look closer at the start_dp_msg changes, though I'm not sure about whether they would make a measurable difference performance-wise. | 
| This pull request has merge conflicts that must be resolved before it can be | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! It's great to have support for DP with Ray.
I left some comments, please take a look. Also, are there any testing done so far?
|  | ||
| head_ip = get_ip() | ||
|  | ||
| def sort_by_driver_then_worker_ip(ip_and_id): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ip_and_id is not accurate, as the tuple is (ip, node_id, int(num_gpus))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good advice, thank you.
| num_gpus = node_info["Resources"].get("GPU", 0) | ||
| node_gpu_mapping.append((ip, node_id, int(num_gpus))) | ||
| node_gpu_mapping = sorted(node_gpu_mapping, key=sort_by_driver_then_worker_ip) | ||
| return node_gpu_mapping | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the return value is not a mapping, but a list. Also consider update method name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good advice, thank you.
| if dp_rank * world_size < accumulated_gpus + num_gpus: | ||
| gpu_index = dp_rank * world_size - accumulated_gpus | ||
| return (ip, node_id, gpu_index) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what happens if a DP rank spans across multiple nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The world size here is identical to tensor_parallel_size (for V1 do not support pipeline parallel), therefore, it's OK if a DP rank spans across multiple nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, V1 does support pipeline parallel. Also, can you elaborate a bit "it's OK if a DP rank spans across multiple nodes"? How are nodes assigned to each DP rank?
| return (0 if ip == head_ip else 1, ip) | ||
|  | ||
| def _get_gpu_mapping(): | ||
| nodes = ray.nodes() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we use a placement group to define the resources for vLLM. This PR directly uses node resources from the cluster, without respecting the placement group. I think we will need to make things consistent and have a well-defined resource allocation protocol.
| 
 Hi, Is the non-ray multi-node support version ready? | 
| if dp_rank * world_size < accumulated_gpus + num_gpus: | ||
| gpu_index = dp_rank * world_size - accumulated_gpus | ||
| return (ip, node_id, gpu_index) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, V1 does support pipeline parallel. Also, can you elaborate a bit "it's OK if a DP rank spans across multiple nodes"? How are nodes assigned to each DP rank?
|  | ||
| node_gpu_mapping = _get_gpu_mapping() | ||
| current_dp_rank = parallel_config.data_parallel_rank | ||
| selected_node_ip, selected_node_id, gpu_index = _find_target_gpu(node_gpu_mapping, current_dp_rank) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu_index is not actually used?
| 
 @Oneal65 yes, in PR #15977, should hopefully be merged soon. I am working on a follow-on to this to support multiple API server processes, which is needed to avoid the front-end process becoming a bottleneck when there are multiple DP engines. | 
| This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! | 
| This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! | 
ray-based support for multi-node dp and reduce start_dp_msg communication in DPAsyncMPClient class.