-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[Doc] Add documents for multi-node distributed serving with MP backend #30509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Documentation preview: https://vllm--30509.org.readthedocs.build/en/30509/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # use the loopback address get_loopback_ip() for communication. | ||
| distributed_init_method = get_distributed_init_method( | ||
| get_loopback_ip(), get_open_port() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix multi-node init address for multiprocessing backend
Multi-node serving with the multiprocessing executor still cannot work because the distributed process group is initialized with get_distributed_init_method(get_loopback_ip(), get_open_port()), forcing every node to bind to 127.0.0.1 and a local port instead of the configured master_addr. When following the new multi-node docs (nnodes>1, differing node_ranks), each node forms its own local group and torch.distributed.init_process_group never connects across nodes, so startup will hang/fail. The init method needs to use the shared master address/port for multi-node runs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds documentation for using the multiprocessing backend in a multi-node distributed serving setup. My review focuses on the clarity of the new documentation and the correctness of the related code changes. I've found some areas for improvement in the documentation examples to enhance clarity and consistency. More importantly, I've identified a critical issue in the MultiprocExecutor that appears to prevent the new multi-node functionality from working as intended. Please see my detailed comments for suggestions.
| # use the loopback address get_loopback_ip() for communication. | ||
| distributed_init_method = get_distributed_init_method( | ||
| get_loopback_ip(), get_open_port() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distributed_init_method is hardcoded to use get_loopback_ip(), which is only suitable for single-node deployments. For multi-node deployments to work as described in the new documentation, this needs to use the master_addr and master_port from the parallel configuration.
Without this change, the multi-node feature with the multiprocessing backend will fail to initialize the distributed process group correctly across nodes.
| # use the loopback address get_loopback_ip() for communication. | |
| distributed_init_method = get_distributed_init_method( | |
| get_loopback_ip(), get_open_port() | |
| ) | |
| if self.parallel_config.nnodes > 1: | |
| distributed_init_method = get_distributed_init_method( | |
| self.parallel_config.master_addr, self.parallel_config.master_port) | |
| else: | |
| # use the loopback address get_loopback_ip() for communication. | |
| distributed_init_method = get_distributed_init_method( | |
| get_loopback_ip(), get_open_port()) |
| vllm serve /path/to/the/model/in/the/container \ | ||
| -tp=8 -pp=2 --nnodes 2 --node-rank 0 \ | ||
| --master-addr 172.16.98.223 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other examples in this document and for better clarity, I suggest using the long-form arguments and a placeholder for the IP address instead of a hardcoded one.
| vllm serve /path/to/the/model/in/the/container \ | |
| -tp=8 -pp=2 --nnodes 2 --node-rank 0 \ | |
| --master-addr 172.16.98.223 | |
| vllm serve /path/to/the/model/in/the/container \ | |
| --tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 0 \ | |
| --master-addr <HEAD_NODE_IP> |
| vllm serve /path/to/the/model/in/the/container \ | ||
| -tp=8 -pp=2 --nnodes 2 --node-rank 1 \ | ||
| --master-addr 172.16.98.223 --headless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other examples in this document and for better clarity, I suggest using the long-form arguments and a placeholder for the IP address instead of a hardcoded one.
| vllm serve /path/to/the/model/in/the/container \ | |
| -tp=8 -pp=2 --nnodes 2 --node-rank 1 \ | |
| --master-addr 172.16.98.223 --headless | |
| vllm serve /path/to/the/model/in/the/container \ | |
| --tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 1 \ | |
| --master-addr <HEAD_NODE_IP> --headless |
|
Thank you @Isotr0py! cc @luccafong |
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.