[Doc] Add documents for multi-node distributed serving with MP backend #30509

Isotr0py · 2025-12-11T19:18:07Z

Purpose

Add missing documentation for [V1] Support MP Executor for multi node distributed inference #23691
Add basic usage introduction to launch multi-node distributed serving with MP backend

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify · 2025-12-11T19:18:55Z

Documentation preview: https://vllm--30509.org.readthedocs.build/en/30509/

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-11T19:19:38Z

vllm/v1/executor/multiproc_executor.py

+        # use the loopback address get_loopback_ip() for communication.
        distributed_init_method = get_distributed_init_method(
            get_loopback_ip(), get_open_port()


Fix multi-node init address for multiprocessing backend

Multi-node serving with the multiprocessing executor still cannot work because the distributed process group is initialized with get_distributed_init_method(get_loopback_ip(), get_open_port()), forcing every node to bind to 127.0.0.1 and a local port instead of the configured master_addr. When following the new multi-node docs (nnodes>1, differing node_ranks), each node forms its own local group and torch.distributed.init_process_group never connects across nodes, so startup will hang/fail. The init method needs to use the shared master address/port for multi-node runs.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request adds documentation for using the multiprocessing backend in a multi-node distributed serving setup. My review focuses on the clarity of the new documentation and the correctness of the related code changes. I've found some areas for improvement in the documentation examples to enhance clarity and consistency. More importantly, I've identified a critical issue in the MultiprocExecutor that appears to prevent the new multi-node functionality from working as intended. Please see my detailed comments for suggestions.

gemini-code-assist · 2025-12-11T19:19:58Z

vllm/v1/executor/multiproc_executor.py

+        # use the loopback address get_loopback_ip() for communication.
        distributed_init_method = get_distributed_init_method(
            get_loopback_ip(), get_open_port()
        )


The distributed_init_method is hardcoded to use get_loopback_ip(), which is only suitable for single-node deployments. For multi-node deployments to work as described in the new documentation, this needs to use the master_addr and master_port from the parallel configuration.

Without this change, the multi-node feature with the multiprocessing backend will fail to initialize the distributed process group correctly across nodes.

Suggested change

# use the loopback address get_loopback_ip() for communication.

distributed_init_method = get_distributed_init_method(

get_loopback_ip(), get_open_port()

)

if self.parallel_config.nnodes > 1:

distributed_init_method = get_distributed_init_method(

self.parallel_config.master_addr, self.parallel_config.master_port)

else:

# use the loopback address get_loopback_ip() for communication.

distributed_init_method = get_distributed_init_method(

get_loopback_ip(), get_open_port())

gemini-code-assist · 2025-12-11T19:19:58Z

docs/serving/parallelism_scaling.md

+vllm serve /path/to/the/model/in/the/container \
+  -tp=8 -pp=2 --nnodes 2 --node-rank 0 \
+  --master-addr 172.16.98.223


For consistency with other examples in this document and for better clarity, I suggest using the long-form arguments and a placeholder for the IP address instead of a hardcoded one.

Suggested change

vllm serve /path/to/the/model/in/the/container \

-tp=8 -pp=2 --nnodes 2 --node-rank 0 \

--master-addr 172.16.98.223

vllm serve /path/to/the/model/in/the/container \

--tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 0 \

--master-addr <HEAD_NODE_IP>

gemini-code-assist · 2025-12-11T19:19:58Z

docs/serving/parallelism_scaling.md

+vllm serve /path/to/the/model/in/the/container \
+  -tp=8 -pp=2 --nnodes 2 --node-rank 1 \
+  --master-addr 172.16.98.223 --headless


For consistency with other examples in this document and for better clarity, I suggest using the long-form arguments and a placeholder for the IP address instead of a hardcoded one.

Suggested change

vllm serve /path/to/the/model/in/the/container \

-tp=8 -pp=2 --nnodes 2 --node-rank 1 \

--master-addr 172.16.98.223 --headless

vllm serve /path/to/the/model/in/the/container \

--tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 1 \

--master-addr <HEAD_NODE_IP> --headless

njhill · 2025-12-11T22:03:38Z

Thank you @Isotr0py! cc @luccafong

Isotr0py added 3 commits December 12, 2025 01:35

add mp multinode doc

c0b21c2

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update

b1d0c76

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

oops

20f7134

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify bot added documentation Improvements or additions to documentation v1 labels Dec 11, 2025

chatgpt-codex-connector bot reviewed Dec 11, 2025

View reviewed changes

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Doc] Add documents for multi-node distributed serving with MP backend #30509

[Doc] Add documents for multi-node distributed serving with MP backend #30509

Isotr0py commented Dec 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Dec 11, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

njhill commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Doc] Add documents for multi-node distributed serving with MP backend #30509

Are you sure you want to change the base?

[Doc] Add documents for multi-node distributed serving with MP backend #30509

Conversation

Isotr0py commented Dec 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Dec 11, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Isotr0py commented Dec 11, 2025 •

edited by github-actions bot

Loading