(2023-10-04//conda_envs/vLLM_dev) sandeep@x3005c0s37b1n0:/vLLM-Examples> python -m vllm.entrypoints.api_server --model gpt2 --trust-remote-code --tensor-parallel-size 6 --host localhost /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( 2024-03-30 08:27:13,261 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 10.201.1.236:6379... 2024-03-30 08:27:13,265 INFO worker.py:1752 -- Connected to Ray cluster. INFO 03-30 08:27:14 llm_engine.py:75] Initializing an LLM engine (v0.3.3) with config: model='gpt2', tokenizer='gpt2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=6, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) (pid=26582) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. (pid=26582) warnings.warn( (pid=26945) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) (pid=26945) warnings.warn( [repeated 2x across cluster] (pid=23971, ip=10.201.1.205) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. [repeated 2x across cluster] (pid=23971, ip=10.201.1.205) warnings.warn( [repeated 2x across cluster] (pid=24044, ip=10.201.1.205) /conda_envs/vLLM_dev/lib/python3.9/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. (pid=24044, ip=10.201.1.205) warnings.warn( (RayWorkerVllm pid=23971, ip=10.201.1.205) INFO 03-30 08:27:49 selector.py:16] Using FlashAttention backend. INFO 03-30 08:27:49 selector.py:16] Using FlashAttention backend. (RayWorkerVllm pid=26741) INFO 03-30 08:27:50 pynccl_utils.py:44] vLLM is using nccl==2.18.3 INFO 03-30 08:27:51 pynccl_utils.py:44] vLLM is using nccl==2.18.3 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Bootstrap : Using bond0:10.140.57.87<0> x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation x3005c0s37b1n0:24760:24760 [0] NCCL INFO Bootstrap : Using bond0:10.140.57.87<0> x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation x3005c0s37b1n0:24760:24760 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.18.6+cuda11.8 x3005c0s37b1n0:24760:27279 [0] NCCL INFO NET/IB : No device found. x3005c0s37b1n0:24760:27279 [0] NCCL INFO NET/Socket : Using [0]bond0:10.140.57.87<0> [1]hsn1:10.201.1.242<0> [2]hsn0:10.201.1.236<0> x3005c0s37b1n0:24760:27279 [0] NCCL INFO Using network Socket x3005c0s37b1n0:24760:27279 [0] NCCL INFO comm 0x9ba6ee0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0x8bf96ca31aef0128 - Init START x3005c0s37b1n0:24760:27279 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,ff000000 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/04 : 0 3 2 4 5 1 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 01/04 : 0 5 4 3 2 1 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/04 : 0 3 2 4 5 1 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 03/04 : 0 5 4 3 2 1 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->4 [3] -1/-1/-1->0->1 x3005c0s37b1n0:24760:27279 [0] NCCL INFO P2P Chunksize set to 131072 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/IPC/read x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/IPC/read x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [send] via NET/Socket/1 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[1] [send] via NET/Socket/1 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Connected all rings x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [receive] via NET/Socket/2 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 4[0] -> 0[0] [receive] via NET/Socket/2 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [send] via NET/Socket/2 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[0] [send] via NET/Socket/2 x3005c0s37b1n0:24760:27279 [0] NCCL INFO Connected all trees x3005c0s37b1n0:24760:27279 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512 x3005c0s37b1n0:24760:27279 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer x3005c0s37b1n0:24760:27279 [0] NCCL INFO comm 0x9ba6ee0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0x8bf96ca31aef0128 - Init COMPLETE x3005c0s37b1n0:24760:24760 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.18.3+cuda11.0 x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/IB : No device found. x3005c0s37b1n0:24760:24760 [0] NCCL INFO NET/Socket : Using [0]bond0:10.140.57.87<0> [1]hsn1:10.201.1.242<0> [2]hsn0:10.201.1.236<0> x3005c0s37b1n0:24760:24760 [0] NCCL INFO Using network Socket x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0xebf3f60 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe82316517410d01 - Init START x3005c0s37b1n0:24760:24760 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,ff000000 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/04 : 0 3 2 4 5 1 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 01/04 : 0 5 4 3 2 1 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/04 : 0 3 2 4 5 1 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 03/04 : 0 5 4 3 2 1 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->4 [3] -1/-1/-1->0->1 x3005c0s37b1n0:24760:24760 [0] NCCL INFO P2P Chunksize set to 131072 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/IPC/read x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/IPC/read x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [send] via NET/Socket/1 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[1] [send] via NET/Socket/1 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Connected all rings x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [receive] via NET/Socket/2 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 4[0] -> 0[0] [receive] via NET/Socket/2 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [send] via NET/Socket/2 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[0] [send] via NET/Socket/2 x3005c0s37b1n0:24760:24760 [0] NCCL INFO Connected all trees x3005c0s37b1n0:24760:24760 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512 x3005c0s37b1n0:24760:24760 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0xebf3f60 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe82316517410d01 - Init COMPLETE WARNING 03-30 08:27:53 custom_all_reduce.py:149] Cannot test GPU P2P because not all GPUs are visible to the current process. This might be the case if 'CUDA_VISIBLE_DEVICES' is set. WARNING 03-30 08:27:53 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerVllm pid=26741) WARNING 03-30 08:27:53 custom_all_reduce.py:149] Cannot test GPU P2P because not all GPUs are visible to the current process. This might be the case if 'CUDA_VISIBLE_DEVICES' is set. (RayWorkerVllm pid=26741) WARNING 03-30 08:27:53 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 03-30 08:27:54 weight_utils.py:177] Using model weights format ['*.safetensors'] (RayWorkerVllm pid=26741) INFO 03-30 08:27:54 weight_utils.py:177] Using model weights format ['*.safetensors'] (RayWorkerVllm pid=23971, ip=10.201.1.205) INFO 03-30 08:27:57 model_runner.py:104] Loading model weights took 0.0402 GB (RayWorkerVllm pid=27081) INFO 03-30 08:27:49 selector.py:16] Using FlashAttention backend. [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:27:51 pynccl_utils.py:44] vLLM is using nccl==2.18.3 [repeated 4x across cluster] INFO 03-30 08:27:57 model_runner.py:104] Loading model weights took 0.0402 GB x3005c0s37b1n0:24760:27442 [0] NCCL INFO Using network Socket x3005c0s37b1n0:24760:27442 [0] NCCL INFO comm 0xfb9e1b0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe8e0e6498a8de01a - Init START x3005c0s37b1n0:24760:27442 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,ff000000 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/04 : 0 3 2 4 5 1 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 01/04 : 0 5 4 3 2 1 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/04 : 0 3 2 4 5 1 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 03/04 : 0 5 4 3 2 1 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->4 [3] -1/-1/-1->0->1 x3005c0s37b1n0:24760:27442 [0] NCCL INFO P2P Chunksize set to 131072 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/IPC/read x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/IPC/read x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [send] via NET/Socket/1 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[1] [send] via NET/Socket/1 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Connected all rings x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [receive] via NET/Socket/2 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 4[0] -> 0[0] [receive] via NET/Socket/2 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [send] via NET/Socket/2 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[0] [send] via NET/Socket/2 x3005c0s37b1n0:24760:27442 [0] NCCL INFO Connected all trees x3005c0s37b1n0:24760:27442 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512 x3005c0s37b1n0:24760:27442 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer x3005c0s37b1n0:24760:27442 [0] NCCL INFO comm 0xfb9e1b0 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe8e0e6498a8de01a - Init COMPLETE x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 02/1 : 5[1] -> 0[0] [receive] via NET/Socket/2/Shared x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 03/1 : 5[1] -> 0[0] [receive] via NET/Socket/1/Shared x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 02/1 : 4[0] -> 0[0] [receive] via NET/Socket/2/Shared x3005c0s37b1n0:24760:27454 [0] NCCL INFO Channel 03/1 : 4[0] -> 0[0] [receive] via NET/Socket/1/Shared INFO 03-30 08:28:01 ray_gpu_executor.py:240] # GPU blocks: 375662, # CPU blocks: 43690 INFO 03-30 08:28:03 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 03-30 08:28:03 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (RayWorkerVllm pid=26741) INFO 03-30 08:28:03 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (RayWorkerVllm pid=26741) INFO 03-30 08:28:03 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (RayWorkerVllm pid=24044, ip=10.201.1.205) WARNING 03-30 08:27:53 custom_all_reduce.py:149] Cannot test GPU P2P because not all GPUs are visible to the current process. This might be the case if 'CUDA_VISIBLE_DEVICES' is set. [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) WARNING 03-30 08:27:53 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:27:54 weight_utils.py:177] Using model weights format ['*.safetensors'] [repeated 4x across cluster] (RayWorkerVllm pid=27081) INFO 03-30 08:27:57 model_runner.py:104] Loading model weights took 0.0402 GB [repeated 4x across cluster] x3005c0s37b1n0:24760:24760 [0] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465. x3005c0s37b1n0:24760:24760 [0] NCCL INFO enqueue.cc:1548 -> 5 x3005c0s37b1n0:24760:24760 [0] NCCL INFO enqueue.cc:1589 -> 5 x3005c0s37b1n0:24760:24760 [0] NCCL INFO enqueue.cc:1594 -> 5 Traceback (most recent call last): File "/conda_envs/vLLM_dev/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/conda_envs/vLLM_dev/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/vllm/vllm/entrypoints/api_server.py", line 105, in (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] Error executing method warm_up_model. This might cause deadlock in distributed execution. (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] Traceback (most recent call last): (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/engine/ray_utils.py", line 37, in execute_method (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return executor(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/worker/worker.py", line 167, in warm_up_model (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] self.model_runner.capture_model(self.gpu_cache) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return func(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/worker/model_runner.py", line 854, in capture_model (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] graph_runner.capture( (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/worker/model_runner.py", line 921, in capture (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] hidden_states = self.model( (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return forward_call(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/models/gpt2.py", line 225, in forward (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] hidden_states = self.transformer(input_ids, positions, kv_caches, (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return forward_call(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/models/gpt2.py", line 191, in forward (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] inputs_embeds = self.wte(input_ids) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return self._call_impl(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] return forward_call(*args, **kwargs) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] output = tensor_model_parallel_all_reduce(output_parallel) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] pynccl_utils.all_reduce(input_) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/parallel_utils/pynccl_utils.py", line 54, in all_reduce (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] comm.all_reduce(input_, op) (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/parallel_utils/pynccl.py", line 257, in all_reduce engine = AsyncLLMEngine.from_engine_args( (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] assert result == 0 (RayWorkerVllm pid=26741) ERROR 03-30 08:28:03 ray_utils.py:44] AssertionError File "/vllm/vllm/engine/async_llm_engine.py", line 348, in from_engine_args engine = cls( File "/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__ self.engine = self._init_engine(*args, **kwargs) File "/vllm/vllm/engine/async_llm_engine.py", line 422, in _init_engine return engine_class(*args, **kwargs) File "/vllm/vllm/engine/llm_engine.py", line 111, in __init__ self.model_executor = executor_class(model_config, cache_config, File "/vllm/vllm/executor/ray_gpu_executor.py", line 65, in __init__ self._init_cache() File "/vllm/vllm/executor/ray_gpu_executor.py", line 253, in _init_cache self._run_workers("warm_up_model") File "/vllm/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/vllm/vllm/worker/worker.py", line 167, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/vllm/vllm/worker/model_runner.py", line 854, in capture_model graph_runner.capture( File "/vllm/vllm/worker/model_runner.py", line 921, in capture hidden_states = self.model( File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/vllm/vllm/model_executor/models/gpt2.py", line 225, in forward hidden_states = self.transformer(input_ids, positions, kv_caches, File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/vllm/vllm/model_executor/models/gpt2.py", line 191, in forward inputs_embeds = self.wte(input_ids) File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward output = tensor_model_parallel_all_reduce(output_parallel) File "/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce pynccl_utils.all_reduce(input_) File "/vllm/vllm/model_executor/parallel_utils/pynccl_utils.py", line 54, in all_reduce comm.all_reduce(input_, op) File "/vllm/vllm/model_executor/parallel_utils/pynccl.py", line 257, in all_reduce assert result == 0 AssertionError (RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:28:03 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) INFO 03-30 08:28:03 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] Error executing method warm_up_model. This might cause deadlock in distributed execution. [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] Traceback (most recent call last): [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/engine/ray_utils.py", line 37, in execute_method [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] return executor(*args, **kwargs) [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/worker/worker.py", line 167, in warm_up_model [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] self.model_runner.capture_model(self.gpu_cache) [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] return func(*args, **kwargs) [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/worker/model_runner.py", line 854, in capture_model [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] graph_runner.capture( [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/worker/model_runner.py", line 921, in capture [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] hidden_states = self.model( [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [repeated 12x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] return self._call_impl(*args, **kwargs) [repeated 12x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/conda_envs/vLLM_dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [repeated 12x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] return forward_call(*args, **kwargs) [repeated 12x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/models/gpt2.py", line 191, in forward [repeated 8x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] hidden_states = self.transformer(input_ids, positions, kv_caches, [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] inputs_embeds = self.wte(input_ids) [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] output = tensor_model_parallel_all_reduce(output_parallel) [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] pynccl_utils.all_reduce(input_) [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/parallel_utils/pynccl_utils.py", line 54, in all_reduce [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] comm.all_reduce(input_, op) [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] File "/vllm/vllm/model_executor/parallel_utils/pynccl.py", line 257, in all_reduce [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] assert result == 0 [repeated 4x across cluster] (RayWorkerVllm pid=24044, ip=10.201.1.205) ERROR 03-30 08:28:03 ray_utils.py:44] AssertionError [repeated 4x across cluster] x3005c0s37b1n0:24760:27286 [0] NCCL INFO [Service thread] Connection closed by localRank 0 x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0x9ba6ee0 rank 0 nranks 6 cudaDev 0 busId 7000 - Abort COMPLETE x3005c0s37b1n0:24760:24760 [0] NCCL INFO comm 0xebf3f60 rank 0 nranks 6 cudaDev 0 busId 7000 - Destroy COMPLETE