From 3ace5d1bfdc33682b564766157569e2e8e29d774 Mon Sep 17 00:00:00 2001
From: youkaichao <youkaichao@126.com>
Date: Fri, 14 Jun 2024 16:56:23 -0700
Subject: [PATCH 1/5] add crash tips

---
 docs/source/getting_started/debugging.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst
index ff37f4e628692..76f3681e6a81c 100644
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -24,6 +24,8 @@ If you have already taken care of the above issues, but the vLLM instance still
 
 With more logging, hopefully you can find the root cause of the issue.
 
+If it crashes, and the error trace shows somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a cuda error inside cudagraph. To know the particular cuda operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the ``LLM`` class, to disable the cudagraph optimization. This way, you can locate the exact cuda operation that causes the error.
+
 Here are some common issues that can cause hangs:
 
 - **Incorrect network setup**: The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``.

From 18cd460f6136cee8f764f2c0e2d8a70786fee596 Mon Sep 17 00:00:00 2001
From: youkaichao <youkaichao@126.com>
Date: Fri, 14 Jun 2024 17:13:18 -0700
Subject: [PATCH 2/5] add multi-node

---
 docs/source/getting_started/debugging.rst | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst
index 76f3681e6a81c..f9d49bc3830a9 100644
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -35,6 +35,9 @@ Here are some common issues that can cause hangs:
 
     # save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
     # adjust `--nproc-per-node` to the number of GPUs you want to use.
+    # for multi-node test, run it with `NCCL_DEBUG=TRACE MASTER_IP=xxx.xxx.xxx.xxx torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_IP test.py` in every node.
+    # adjust `--nnodes` and `--nproc-per-node` to the number of nodes and GPUs you want to use.
+    # also make sure `MASTER_IP` is the correct IP address of the master node, and it is reachable from all nodes.
     import torch
     import torch.distributed as dist
     dist.init_process_group(backend="nccl")

From 998b43f6a69c72c9d1c271382eea73dc353e3772 Mon Sep 17 00:00:00 2001
From: youkaichao <youkaichao@126.com>
Date: Sun, 16 Jun 2024 12:48:27 -0700
Subject: [PATCH 3/5] add multi-node debugging tips

---
 docs/source/getting_started/debugging.rst | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst
index f9d49bc3830a9..7d42f24161fca 100644
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -33,18 +33,23 @@ Here are some common issues that can cause hangs:
 
 .. code-block:: python
 
-    # save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
-    # adjust `--nproc-per-node` to the number of GPUs you want to use.
-    # for multi-node test, run it with `NCCL_DEBUG=TRACE MASTER_IP=xxx.xxx.xxx.xxx torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_IP test.py` in every node.
-    # adjust `--nnodes` and `--nproc-per-node` to the number of nodes and GPUs you want to use.
-    # also make sure `MASTER_IP` is the correct IP address of the master node, and it is reachable from all nodes.
     import torch
     import torch.distributed as dist
     dist.init_process_group(backend="nccl")
-    data = torch.FloatTensor([1,] * 128).to(f"cuda:{dist.get_rank()}")
+    local_rank = dist.get_rank() % torch.cuda.device_count()
+    data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
     dist.all_reduce(data, op=dist.ReduceOp.SUM)
     torch.cuda.synchronize()
     value = data.mean().item()
     assert value == dist.get_world_size()
 
+.. tip::
+
+    Save the script as ``test.py``. If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use.
+    
+    If you are testing with multi-nodes, run it with the following command: ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``:
+    - is the correct IP address of the master node
+    - is reachable from all nodes
+    - is set before running the script.
+
 If the problem persists, feel free to `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_, with a detailed description of the issue, your environment, and the logs.

From 6ecf05d4ff02d6e06ca335a0ee7ad44a2d9adcc8 Mon Sep 17 00:00:00 2001
From: youkaichao <youkaichao@126.com>
Date: Sun, 16 Jun 2024 12:56:04 -0700
Subject: [PATCH 4/5] fix format

---
 docs/source/getting_started/debugging.rst | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst
index 7d42f24161fca..2d6e3c5658305 100644
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -45,9 +45,12 @@ Here are some common issues that can cause hangs:
 
 .. tip::
 
-    Save the script as ``test.py``. If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use.
+    Save the script as ``test.py``.
+    
+    If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use.
     
     If you are testing with multi-nodes, run it with the following command: ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``:
+  
     - is the correct IP address of the master node
     - is reachable from all nodes
     - is set before running the script.

From 85a3bf2e6ebf2f36eb84890e1a9f5c807e4ff118 Mon Sep 17 00:00:00 2001
From: youkaichao <youkaichao@126.com>
Date: Sun, 16 Jun 2024 13:01:08 -0700
Subject: [PATCH 5/5] adjust format

---
 docs/source/getting_started/debugging.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst
index 2d6e3c5658305..a22bba1478abb 100644
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -47,9 +47,9 @@ Here are some common issues that can cause hangs:
 
     Save the script as ``test.py``.
     
-    If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust `--nproc-per-node` to the number of GPUs you want to use.
+    If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust ``--nproc-per-node`` to the number of GPUs you want to use.
     
-    If you are testing with multi-nodes, run it with the following command: ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``:
+    If you are testing with multi-nodes, run it with ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``:
   
     - is the correct IP address of the master node
     - is reachable from all nodes