Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion docs/serving/parallelism_scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul

### What is Ray?

Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments require Ray as the runtime engine.
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments can use Ray as the runtime engine.

vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.

Expand Down Expand Up @@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
--distributed-executor-backend ray
```

### Running vLLM with MultiProcessing

Besides Ray, Multi-node vLLM deployments can also use `multiprocessing` as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with `tp_size=8` and `pp_size=2`.

Choose one node as the head node and run:

```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
--nnodes 2 --node-rank 0 \
--master-addr <HEAD_NODE_IP>
```

On the other worker node, run:

```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
--nnodes 2 --node-rank 1 \
--master-addr <HEAD_NODE_IP> --headless
```

## Optimizing network communication for tensor parallelism

Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
Expand Down
4 changes: 1 addition & 3 deletions vllm/v1/executor/multiproc_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,7 @@ def _init_executor(self) -> None:
# Set multiprocessing envs
set_multiprocessing_worker_envs()

# Multiprocessing-based executor does not support multi-node setting.
# Since it only works for single node, we can use the loopback address
# get_loopback_ip() for communication.
# use the loopback address get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port()
Comment on lines +127 to 129

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix multi-node init address for multiprocessing backend

Multi-node serving with the multiprocessing executor still cannot work because the distributed process group is initialized with get_distributed_init_method(get_loopback_ip(), get_open_port()), forcing every node to bind to 127.0.0.1 and a local port instead of the configured master_addr. When following the new multi-node docs (nnodes>1, differing node_ranks), each node forms its own local group and torch.distributed.init_process_group never connects across nodes, so startup will hang/fail. The init method needs to use the shared master address/port for multi-node runs.

Useful? React with 👍 / 👎.

)
Comment on lines +127 to 130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The distributed_init_method is hardcoded to use get_loopback_ip(), which is only suitable for single-node deployments. For multi-node deployments to work as described in the new documentation, this needs to use the master_addr and master_port from the parallel configuration.

Without this change, the multi-node feature with the multiprocessing backend will fail to initialize the distributed process group correctly across nodes.

Suggested change
# use the loopback address get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port()
)
if self.parallel_config.nnodes > 1:
distributed_init_method = get_distributed_init_method(
self.parallel_config.master_addr, self.parallel_config.master_port)
else:
# use the loopback address get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port())

Expand Down