Skip to content

Conversation

@Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Dec 11, 2025

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify
Copy link

mergify bot commented Dec 11, 2025

Documentation preview: https://vllm--30509.org.readthedocs.build/en/30509/

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Dec 11, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +127 to 129
# use the loopback address get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix multi-node init address for multiprocessing backend

Multi-node serving with the multiprocessing executor still cannot work because the distributed process group is initialized with get_distributed_init_method(get_loopback_ip(), get_open_port()), forcing every node to bind to 127.0.0.1 and a local port instead of the configured master_addr. When following the new multi-node docs (nnodes>1, differing node_ranks), each node forms its own local group and torch.distributed.init_process_group never connects across nodes, so startup will hang/fail. The init method needs to use the shared master address/port for multi-node runs.

Useful? React with 👍 / 👎.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds documentation for using the multiprocessing backend in a multi-node distributed serving setup. My review focuses on the clarity of the new documentation and the correctness of the related code changes. I've found some areas for improvement in the documentation examples to enhance clarity and consistency. More importantly, I've identified a critical issue in the MultiprocExecutor that appears to prevent the new multi-node functionality from working as intended. Please see my detailed comments for suggestions.

Comment on lines +127 to 130
# use the loopback address get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The distributed_init_method is hardcoded to use get_loopback_ip(), which is only suitable for single-node deployments. For multi-node deployments to work as described in the new documentation, this needs to use the master_addr and master_port from the parallel configuration.

Without this change, the multi-node feature with the multiprocessing backend will fail to initialize the distributed process group correctly across nodes.

Suggested change
# use the loopback address get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port()
)
if self.parallel_config.nnodes > 1:
distributed_init_method = get_distributed_init_method(
self.parallel_config.master_addr, self.parallel_config.master_port)
else:
# use the loopback address get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port())

Comment on lines +140 to +142
vllm serve /path/to/the/model/in/the/container \
-tp=8 -pp=2 --nnodes 2 --node-rank 0 \
--master-addr 172.16.98.223
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For consistency with other examples in this document and for better clarity, I suggest using the long-form arguments and a placeholder for the IP address instead of a hardcoded one.

Suggested change
vllm serve /path/to/the/model/in/the/container \
-tp=8 -pp=2 --nnodes 2 --node-rank 0 \
--master-addr 172.16.98.223
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 0 \
--master-addr <HEAD_NODE_IP>

Comment on lines +148 to +150
vllm serve /path/to/the/model/in/the/container \
-tp=8 -pp=2 --nnodes 2 --node-rank 1 \
--master-addr 172.16.98.223 --headless
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For consistency with other examples in this document and for better clarity, I suggest using the long-form arguments and a placeholder for the IP address instead of a hardcoded one.

Suggested change
vllm serve /path/to/the/model/in/the/container \
-tp=8 -pp=2 --nnodes 2 --node-rank 1 \
--master-addr 172.16.98.223 --headless
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 1 \
--master-addr <HEAD_NODE_IP> --headless

@njhill
Copy link
Member

njhill commented Dec 11, 2025

Thank you @Isotr0py! cc @luccafong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants