From 3ace5d1bfdc33682b564766157569e2e8e29d774 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Fri, 14 Jun 2024 16:56:23 -0700 Subject: [PATCH 1/5] add crash tips --- docs/source/getting_started/debugging.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index ff37f4e628692..76f3681e6a81c 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -24,6 +24,8 @@ If you have already taken care of the above issues, but the vLLM instance still With more logging, hopefully you can find the root cause of the issue. +If it crashes, and the error trace shows somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a cuda error inside cudagraph. To know the particular cuda operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the ``LLM`` class, to disable the cudagraph optimization. This way, you can locate the exact cuda operation that causes the error. + Here are some common issues that can cause hangs: - **Incorrect network setup**: The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``. From 18cd460f6136cee8f764f2c0e2d8a70786fee596 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Fri, 14 Jun 2024 17:13:18 -0700 Subject: [PATCH 2/5] add multi-node --- docs/source/getting_started/debugging.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index 76f3681e6a81c..f9d49bc3830a9 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -35,6 +35,9 @@ Here are some common issues that can cause hangs: # save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py` # adjust `--nproc-per-node` to the number of GPUs you want to use. + # for multi-node test, run it with `NCCL_DEBUG=TRACE MASTER_IP=xxx.xxx.xxx.xxx torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_IP test.py` in every node. + # adjust `--nnodes` and `--nproc-per-node` to the number of nodes and GPUs you want to use. + # also make sure `MASTER_IP` is the correct IP address of the master node, and it is reachable from all nodes. import torch import torch.distributed as dist dist.init_process_group(backend="nccl") From 998b43f6a69c72c9d1c271382eea73dc353e3772 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Sun, 16 Jun 2024 12:48:27 -0700 Subject: [PATCH 3/5] add multi-node debugging tips --- docs/source/getting_started/debugging.rst | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index f9d49bc3830a9..7d42f24161fca 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -33,18 +33,23 @@ Here are some common issues that can cause hangs: .. code-block:: python - # save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py` - # adjust `--nproc-per-node` to the number of GPUs you want to use. - # for multi-node test, run it with `NCCL_DEBUG=TRACE MASTER_IP=xxx.xxx.xxx.xxx torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_IP test.py` in every node. - # adjust `--nnodes` and `--nproc-per-node` to the number of nodes and GPUs you want to use. - # also make sure `MASTER_IP` is the correct IP address of the master node, and it is reachable from all nodes. import torch import torch.distributed as dist dist.init_process_group(backend="nccl") - data = torch.FloatTensor([1,] * 128).to(f"cuda:{dist.get_rank()}") + local_rank = dist.get_rank() % torch.cuda.device_count() + data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}") dist.all_reduce(data, op=dist.ReduceOp.SUM) torch.cuda.synchronize() value = data.mean().item() assert value == dist.get_world_size() +.. tip:: + + Save the script as ``test.py``. If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use. + + If you are testing with multi-nodes, run it with the following command: ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``: + - is the correct IP address of the master node + - is reachable from all nodes + - is set before running the script. + If the problem persists, feel free to `open an issue on GitHub `_, with a detailed description of the issue, your environment, and the logs. From 6ecf05d4ff02d6e06ca335a0ee7ad44a2d9adcc8 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Sun, 16 Jun 2024 12:56:04 -0700 Subject: [PATCH 4/5] fix format --- docs/source/getting_started/debugging.rst | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index 7d42f24161fca..2d6e3c5658305 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -45,9 +45,12 @@ Here are some common issues that can cause hangs: .. tip:: - Save the script as ``test.py``. If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use. + Save the script as ``test.py``. + + If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use. If you are testing with multi-nodes, run it with the following command: ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``: + - is the correct IP address of the master node - is reachable from all nodes - is set before running the script. From 85a3bf2e6ebf2f36eb84890e1a9f5c807e4ff118 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Sun, 16 Jun 2024 13:01:08 -0700 Subject: [PATCH 5/5] adjust format --- docs/source/getting_started/debugging.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index 2d6e3c5658305..a22bba1478abb 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -47,9 +47,9 @@ Here are some common issues that can cause hangs: Save the script as ``test.py``. - If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use. + If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust ``--nproc-per-node`` to the number of GPUs you want to use. - If you are testing with multi-nodes, run it with the following command: ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``: + If you are testing with multi-nodes, run it with ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``: - is the correct IP address of the master node - is reachable from all nodes