(2023-10-04//conda_envs/vLLM_dev) sandeep@x3005c0s37b1n0:/vLLM-Examples> python -m vllm.entrypoints.api_server --model gpt2  --trust-remote-code --tensor-parallel-size 6 --host localhost
/conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
2024-03-30 08:27:13,261 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 10.201.1.236:6379...
2024-03-30 08:27:13,265 INFO worker.py:1752 -- Connected to Ray cluster.
INFO 03-30 08:27:14 llm_engine.py:75] Initializing an LLM engine (v0.3.3) with config: model='gpt2', tokenizer='gpt2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=6, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
(pid=26582) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
(pid=26582)   warnings.warn(
(pid=26945) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(pid=26945)   warnings.warn( [repeated 2x across cluster]
(pid=23971, ip=10.201.1.205) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. [repeated 2x across cluster]
(pid=23971, ip=10.201.1.205)   warnings.warn( [repeated 2x across cluster]
(pid=24044, ip=10.201.1.205) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
(pid=24044, ip=10.201.1.205)   warnings.warn(
(RayWorkerVllm pid=23971, ip=10.201.1.205) INFO 03-30 08:27:49 selector.py:16] Using FlashAttention backend.
INFO 03-30 08:27:49 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=26741) INFO 03-30 08:27:50 pynccl_utils.py:44] vLLM is using nccl==2.18.3
INFO 03-30 08:27:51 pynccl_utils.py:44] vLLM is using nccl==2.18.3
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Bootstrap : Using bond0:10.140.57.87<0>
x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Bootstrap : Using bond0:10.140.57.87<0>
x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
x3005c0s37b1n0:24760:24760 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.18.6+cuda11.8
x3005c0s37b1n0:24760:27279 [0] NCCL INFO NET/IB : No device found.
x3005c0s37b1n0:24760:27279 [0] NCCL INFO NET/Socket : Using [0]bond0:10.140.57.87<0> [1]hsn1:10.201.1.242<0> [2]hsn0:10.201.1.236<0>
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Using network Socket
x3005c0s37b1n0:24760:27279 [0] NCCL INFO comm 0x9ba6ee0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0x8bf96ca31aef0128 - Init START
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,ff000000
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/04 :    0   3   2   4   5   1
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 01/04 :    0   5   4   3   2   1
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/04 :    0   3   2   4   5   1
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 03/04 :    0   5   4   3   2   1
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->4 [3] -1/-1/-1->0->1
x3005c0s37b1n0:24760:27279 [0] NCCL INFO P2P Chunksize set to 131072
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/IPC/read
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/IPC/read
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [send] via NET/Socket/1
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[1] [send] via NET/Socket/1
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Connected all rings
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [receive] via NET/Socket/2
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 4[0] -> 0[0] [receive] via NET/Socket/2
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [send] via NET/Socket/2
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[0] [send] via NET/Socket/2
x3005c0s37b1n0:24760:27279 [0] NCCL INFO Connected all trees
x3005c0s37b1n0:24760:27279 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
x3005c0s37b1n0:24760:27279 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x3005c0s37b1n0:24760:27279 [0] NCCL INFO comm 0x9ba6ee0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0x8bf96ca31aef0128 - Init COMPLETE
x3005c0s37b1n0:24760:24760 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.18.3+cuda11.0
x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/IB : No device found.
x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Socket : Using [0]bond0:10.140.57.87<0> [1]hsn1:10.201.1.242<0> [2]hsn0:10.201.1.236<0>
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Using network Socket
x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0xebf3f60 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe82316517410d01 - Init START
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,ff000000
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/04 :    0   3   2   4   5   1
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 01/04 :    0   5   4   3   2   1
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/04 :    0   3   2   4   5   1
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 03/04 :    0   5   4   3   2   1
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->4 [3] -1/-1/-1->0->1
x3005c0s37b1n0:24760:24760 [0] NCCL INFO P2P Chunksize set to 131072
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/IPC/read
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/IPC/read
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [send] via NET/Socket/1
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[1] [send] via NET/Socket/1
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Connected all rings
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [receive] via NET/Socket/2
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 4[0] -> 0[0] [receive] via NET/Socket/2
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [send] via NET/Socket/2
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[0] [send] via NET/Socket/2
x3005c0s37b1n0:24760:24760 [0] NCCL INFO Connected all trees
x3005c0s37b1n0:24760:24760 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
x3005c0s37b1n0:24760:24760 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0xebf3f60 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe82316517410d01 - Init COMPLETE
WARNING 03-30 08:27:53 custom_all_reduce.py:149] Cannot test GPU P2P because not all GPUs are visible to the current process. This might be the case if 'CUDA_VISIBLE_DEVICES' is set.
WARNING 03-30 08:27:53 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=26741) WARNING 03-30 08:27:53 custom_all_reduce.py:149] Cannot test GPU P2P because not all GPUs are visible to the current process. This might be the case if 'CUDA_VISIBLE_DEVICES' is set.
(RayWorkerVllm pid=26741) WARNING 03-30 08:27:53 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 03-30 08:27:54 weight_utils.py:177] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=26741) INFO 03-30 08:27:54 weight_utils.py:177] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=23971, ip=10.201.1.205) INFO 03-30 08:27:57 model_runner.py:104] Loading model weights took 0.0402 GB
(RayWorkerVllm pid=27081) INFO 03-30 08:27:49 selector.py:16] Using FlashAttention backend. [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:27:51 pynccl_utils.py:44] vLLM is using nccl==2.18.3 [repeated 4x across cluster]
INFO 03-30 08:27:57 model_runner.py:104] Loading model weights took 0.0402 GB
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Using network Socket
x3005c0s37b1n0:24760:27442 [0] NCCL INFO comm 0xfb9e1b0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe8e0e6498a8de01a - Init START
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,ff000000
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/04 :    0   3   2   4   5   1
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 01/04 :    0   5   4   3   2   1
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/04 :    0   3   2   4   5   1
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 03/04 :    0   5   4   3   2   1
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->4 [3] -1/-1/-1->0->1
x3005c0s37b1n0:24760:27442 [0] NCCL INFO P2P Chunksize set to 131072
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/IPC/read
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/IPC/read
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [send] via NET/Socket/1
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[1] [send] via NET/Socket/1
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Connected all rings
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [receive] via NET/Socket/2
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 4[0] -> 0[0] [receive] via NET/Socket/2
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [send] via NET/Socket/2
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[0] [send] via NET/Socket/2
x3005c0s37b1n0:24760:27442 [0] NCCL INFO Connected all trees
x3005c0s37b1n0:24760:27442 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
x3005c0s37b1n0:24760:27442 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x3005c0s37b1n0:24760:27442 [0] NCCL INFO comm 0xfb9e1b0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe8e0e6498a8de01a - Init COMPLETE
x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 02/1 : 5[1] -> 0[0] [receive] via NET/Socket/2/Shared
x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 03/1 : 5[1] -> 0[0] [receive] via NET/Socket/1/Shared
x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 02/1 : 4[0] -> 0[0] [receive] via NET/Socket/2/Shared
x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 03/1 : 4[0] -> 0[0] [receive] via NET/Socket/1/Shared
INFO 03-30 08:28:01 ray_gpu_executor.py:240] # GPU blocks: 375662, # CPU blocks: 43690
INFO 03-30 08:28:03 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-30 08:28:03 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=26741) INFO 03-30 08:28:03 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=26741) INFO 03-30 08:28:03 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=24044, ip=10.201.1.205) WARNING 03-30 08:27:53 custom_all_reduce.py:149] Cannot test GPU P2P because not all GPUs are visible to the current process. This might be the case if 'CUDA_VISIBLE_DEVICES' is set. [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) WARNING 03-30 08:27:53 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:27:54 weight_utils.py:177] Using model weights format ['*.safetensors'] [repeated 4x across cluster]
(RayWorkerVllm pid=27081) INFO 03-30 08:27:57 model_runner.py:104] Loading model weights took 0.0402 GB [repeated 4x across cluster]

x3005c0s37b1n0:24760:24760 [0] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.
x3005c0s37b1n0:24760:24760 [0] NCCL INFO enqueue.cc:1548 -> 5
x3005c0s37b1n0:24760:24760 [0] NCCL INFO enqueue.cc:1589 -> 5
x3005c0s37b1n0:24760:24760 [0] NCCL INFO enqueue.cc:1594 -> 5
Traceback (most recent call last):
  File "/conda_envs/vLLM_dev/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/conda_envs/vLLM_dev/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/vllm/vllm/entrypoints/api_server.py", line 105, in <module>
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] Error executing method warm_up_model. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/engine/ray_utils.py", line 37, in execute_method
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return executor(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/worker/worker.py", line 167, in warm_up_model
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     self.model_runner.capture_model(self.gpu_cache)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return func(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/worker/model_runner.py", line 854, in capture_model
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     graph_runner.capture(
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/worker/model_runner.py", line 921, in capture
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     hidden_states = self.model(
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return forward_call(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/models/gpt2.py", line 225, in forward
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     hidden_states = self.transformer(input_ids, positions, kv_caches,
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return forward_call(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/models/gpt2.py", line 191, in forward
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     inputs_embeds = self.wte(input_ids)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return self._call_impl(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     return forward_call(*args, **kwargs)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     output = tensor_model_parallel_all_reduce(output_parallel)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     pynccl_utils.all_reduce(input_)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/parallel_utils/pynccl_utils.py", line 54, in all_reduce
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     comm.all_reduce(input_, op)
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/parallel_utils/pynccl.py", line 257, in all_reduce
    engine = AsyncLLMEngine.from_engine_args(
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44]     assert result == 0
(RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] AssertionError
  File "/vllm/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/vllm/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
  File "/vllm/vllm/engine/llm_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/vllm/vllm/executor/ray_gpu_executor.py", line 65, in __init__
    self._init_cache()
  File "/vllm/vllm/executor/ray_gpu_executor.py", line 253, in _init_cache
    self._run_workers("warm_up_model")
  File "/vllm/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/vllm/vllm/worker/worker.py", line 167, in warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/vllm/vllm/worker/model_runner.py", line 854, in capture_model
    graph_runner.capture(
  File "/vllm/vllm/worker/model_runner.py", line 921, in capture
    hidden_states = self.model(
  File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm/vllm/model_executor/models/gpt2.py", line 225, in forward
    hidden_states = self.transformer(input_ids, positions, kv_caches,
  File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm/vllm/model_executor/models/gpt2.py", line 191, in forward
    inputs_embeds = self.wte(input_ids)
  File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward
    output = tensor_model_parallel_all_reduce(output_parallel)
  File "/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce
    pynccl_utils.all_reduce(input_)
  File "/vllm/vllm/model_executor/parallel_utils/pynccl_utils.py", line 54, in all_reduce
    comm.all_reduce(input_, op)
  File "/vllm/vllm/model_executor/parallel_utils/pynccl.py", line 257, in all_reduce
    assert result == 0
AssertionError
(RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:28:03 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:28:03 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] Error executing method warm_up_model. This might cause deadlock in distributed execution. [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] Traceback (most recent call last): [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/engine/ray_utils.py", line 37, in execute_method [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     return executor(*args, **kwargs) [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/worker/worker.py", line 167, in warm_up_model [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     self.model_runner.capture_model(self.gpu_cache) [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     return func(*args, **kwargs) [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/worker/model_runner.py", line 854, in capture_model [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     graph_runner.capture( [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/worker/model_runner.py", line 921, in capture [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     hidden_states = self.model( [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [repeated 12x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     return self._call_impl(*args, **kwargs) [repeated 12x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [repeated 12x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     return forward_call(*args, **kwargs) [repeated 12x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/models/gpt2.py", line 191, in forward [repeated 8x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     hidden_states = self.transformer(input_ids, positions, kv_caches, [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     inputs_embeds = self.wte(input_ids) [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     output = tensor_model_parallel_all_reduce(output_parallel) [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     pynccl_utils.all_reduce(input_) [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/parallel_utils/pynccl_utils.py", line 54, in all_reduce [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     comm.all_reduce(input_, op) [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]   File "/vllm/vllm/model_executor/parallel_utils/pynccl.py", line 257, in all_reduce [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44]     assert result == 0 [repeated 4x across cluster]
(RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] AssertionError [repeated 4x across cluster]
x3005c0s37b1n0:24760:27286 [0] NCCL INFO [Service thread] Connection closed by localRank 0
x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0x9ba6ee0 rank 0 nranks 6 cudaDev 0 busId 7000 - Abort COMPLETE
x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0xebf3f60 rank 0 nranks 6 cudaDev 0 busId 7000 - Destroy COMPLETE