-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: offline test, Process hangs without exiting when using cuda graph #4263
Comments
Can you try to run with |
I will try it with latest vLLM |
@youkaichao after 2024-04-23 11:07:10.120712 Return from actor_id in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:502
2024-04-23 11:07:10.120731 Return from record_task_log_end in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:628
2024-04-23 11:07:10.120781 Call to get_serialization_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:660
2024-04-23 11:07:10.120801 Call to current_job_id in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:491
2024-04-23 11:07:10.120820 Return from current_job_id in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:494
2024-04-23 11:07:10.120844 Return from get_serialization_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:672
2024-04-23 11:07:10.120863 Call to serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:482
2024-04-23 11:07:10.120886 Call to _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:433
2024-04-23 11:07:10.120924 Call to packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:30
2024-04-23 11:07:10.120950 Return from packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:36
2024-04-23 11:07:10.120979 Call to packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:30
2024-04-23 11:07:10.120999 Return from packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:36
2024-04-23 11:07:10.121018 Return from _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:478
2024-04-23 11:07:10.121037 Return from serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:494
2024-04-23 11:07:10.121084 Call to _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2528
2024-04-23 11:07:10.121104 Call to _mode in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:3051
2024-04-23 11:07:10.121122 Return from _mode in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:3059
2024-04-23 11:07:10.121163 Return from _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2531
2024-04-23 11:07:10.121233 Call to disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:67
2024-04-23 11:07:10.121253 Call to _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:38
2024-04-23 11:07:10.121273 Return from _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:40
2024-04-23 11:07:10.121291 Return from disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69 and the GPU memory not release: +-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L20 On | 00000000:0E:00.0 Off | 0 |
| N/A 38C P0 76W / 350W | 37354MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L20 On | 00000000:0F:00.0 Off | 0 |
| N/A 34C P8 34W / 350W | 81MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L20 On | 00000000:10:00.0 Off | 0 |
| N/A 35C P8 34W / 350W | 9MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L20 On | 00000000:12:00.0 Off | 0 |
| N/A 31C P8 36W / 350W | 21MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+ |
@youkaichao without cuda graph, the traced log: 2024-04-23 10:57:19.506014 Call to serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:482
2024-04-23 10:57:19.506037 Call to _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:433
2024-04-23 10:57:19.506076 Call to packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:30
2024-04-23 10:57:19.506103 Return from packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:36
2024-04-23 10:57:19.506132 Call to packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:30
2024-04-23 10:57:19.506162 Return from packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:36
2024-04-23 10:57:19.506182 Return from _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:478
2024-04-23 10:57:19.506203 Return from serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:494
2024-04-23 10:57:19.506251 Call to _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2528
2024-04-23 10:57:19.506274 Call to _mode in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:3051
2024-04-23 10:57:19.506296 Return from _mode in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:3059
2024-04-23 10:57:19.506334 Return from _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2531
2024-04-23 10:57:19.506406 Call to disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:67
2024-04-23 10:57:19.506428 Call to _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:38
2024-04-23 10:57:19.506448 Return from _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:40
2024-04-23 10:57:19.506465 Return from disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69
2024-04-23 10:57:20.126963 Call to sigterm_handler in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:872
2024-04-23 10:57:20.127050 Return from sigterm_handler in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:873 it will call |
after i try #4278 , the log: on3.10/site-packages/ray/_private/worker.py:2530
2024-04-23 11:32:40.659115 Return from _mode in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:3059 to _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2530
2024-04-23 11:32:40.659164 Return from _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2531 to __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142
2024-04-23 11:32:40.659244 Call to disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:67 from __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142
2024-04-23 11:32:40.659267 Call to _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:38 from disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69
2024-04-23 11:32:40.659291 Return from _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:40 to disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69
2024-04-23 11:32:40.659309 Return from disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69 to __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142 it seems hang here: __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142 |
Please give more logs, at least when the code is related with vllm. All the trace here is related with ray. |
2024-04-23 11:38:48.121370 Return from is_initialized in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:950 to _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:976
2024-04-23 11:38:48.121388 Call to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:583 from _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:981
2024-04-23 11:38:48.121405 Call to default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.121423 Return from default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.121440 Return from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585 to _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:981
2024-04-23 11:38:48.121456 Return from _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:981 to get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1532
2024-04-23 11:38:48.121476 Return from get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1534 to gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2880
2024-04-23 11:38:48.121496 Call to _validate_output_list_for_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2834 from gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2881
2024-04-23 11:38:48.121515 Return from _validate_output_list_for_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2840 to gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2881
2024-04-23 11:38:48.121550 Call to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:583 from gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2888
2024-04-23 11:38:48.121568 Call to default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.121585 Return from default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.121601 Return from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585 to gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2888
2024-04-23 11:38:48.121628 Call to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:762 from gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2892
2024-04-23 11:38:48.121645 Call to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:583 from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:777
2024-04-23 11:38:48.121663 Call to default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.121681 Return from default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.121698 Return from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585 to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:777
2024-04-23 11:38:48.121715 Call to pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:779
2024-04-23 11:38:48.121733 Return from pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:779
2024-04-23 11:38:48.121751 Call to pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:781
2024-04-23 11:38:48.121768 Return from pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:781
2024-04-23 11:38:48.121787 Return from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:785 to gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2892
2024-04-23 11:38:48.130120 Return from gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2899 to wrapper in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:72
2024-04-23 11:38:48.130181 Return from wrapper in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:72 to tensor_model_parallel_gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py:95
2024-04-23 11:38:48.130208 Call to get_tensor_model_parallel_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:216 from tensor_model_parallel_gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py:99
2024-04-23 11:38:48.130229 Call to get_tensor_model_parallel_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:190 from get_tensor_model_parallel_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:218
2024-04-23 11:38:48.130251 Return from get_tensor_model_parallel_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:194 to get_tensor_model_parallel_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:218
2024-04-23 11:38:48.130282 Call to get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1512 from get_tensor_model_parallel_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:218
2024-04-23 11:38:48.130302 Call to _rank_not_in_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:747 from get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1529
2024-04-23 11:38:48.130324 Return from _rank_not_in_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:751 to get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1529
2024-04-23 11:38:48.130346 Call to _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:974 from get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1532
2024-04-23 11:38:48.130366 Call to is_initialized in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:948 from _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:976
2024-04-23 11:38:48.130386 Call to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:583 from is_initialized in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:950
2024-04-23 11:38:48.130405 Call to default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130424 Return from default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130445 Return from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585 to is_initialized in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:950
2024-04-23 11:38:48.130463 Return from is_initialized in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:950 to _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:976
2024-04-23 11:38:48.130482 Call to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:583 from _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:981
2024-04-23 11:38:48.130500 Call to default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130519 Return from default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130536 Return from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585 to _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:981
2024-04-23 11:38:48.130554 Return from _get_default_group in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:981 to get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1532
2024-04-23 11:38:48.130573 Call to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:583 from get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1533
2024-04-23 11:38:48.130598 Call to default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130617 Return from default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130635 Return from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585 to get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1533
2024-04-23 11:38:48.130659 Call to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:762 from get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1536
2024-04-23 11:38:48.130679 Call to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:583 from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:777
2024-04-23 11:38:48.130699 Call to default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130718 Return from default_pg in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585
2024-04-23 11:38:48.130737 Return from WORLD in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:585 to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:777
2024-04-23 11:38:48.130757 Call to pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:779
2024-04-23 11:38:48.130780 Return from pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:779
2024-04-23 11:38:48.130800 Call to pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:781
2024-04-23 11:38:48.130817 Return from pg_group_ranks in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:781
2024-04-23 11:38:48.130837 Return from get_group_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:785 to get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1536
2024-04-23 11:38:48.130856 Return from get_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1536 to get_tensor_model_parallel_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:218
2024-04-23 11:38:48.130873 Return from get_tensor_model_parallel_rank in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:218 to tensor_model_parallel_gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py:99
2024-04-23 11:38:48.130898 Return from tensor_model_parallel_gather in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py:103 to _get_logits in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/logits_processor.py:67
2024-04-23 11:38:48.130925 Return from _get_logits in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/logits_processor.py:71 to forward in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/logits_processor.py:51
2024-04-23 11:38:48.130947 Return from forward in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/logits_processor.py:59 to _call_impl in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520
2024-04-23 11:38:48.130970 Return from _call_impl in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520 to _wrapped_call_impl in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511
2024-04-23 11:38:48.130990 Return from _wrapped_call_impl in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511 to compute_logits in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:366
2024-04-23 11:38:48.131010 Return from compute_logits in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:368 to execute_model in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py:851
2024-04-23 11:38:48.131035 Return from execute_model in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py:855 to decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:115
2024-04-23 11:38:48.131078 Call to __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/autograd/grad_mode.py:271 from decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:114
2024-04-23 11:38:48.131106 Return from __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/autograd/grad_mode.py:272 to decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:114
2024-04-23 11:38:48.131128 Return from decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:114 to execute_model in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py:249
2024-04-23 11:38:48.131157 Return from execute_model in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py:254 to decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:115
2024-04-23 11:38:48.131180 Call to __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/autograd/grad_mode.py:271 from decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:114
2024-04-23 11:38:48.131203 Return from __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/autograd/grad_mode.py:272 to decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:114
2024-04-23 11:38:48.131227 Return from decorate_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py:114 to execute_method in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py:145
2024-04-23 11:38:48.131249 Return from execute_method in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py:145 to _resume_span in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py:467
2024-04-23 11:38:48.131272 Return from _resume_span in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py:467 to actor_method_executor in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/function_manager.py:724
2024-04-23 11:38:48.131301 Return from actor_method_executor in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/function_manager.py:724 to main_loop in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:879
2024-04-23 11:38:48.131332 Call to record_task_log_end in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:620 from main_loop in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:879
2024-04-23 11:38:48.131354 Call to actor_id in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:499 from record_task_log_end in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:623
2024-04-23 11:38:48.131386 Return from actor_id in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:502 to record_task_log_end in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:623
2024-04-23 11:38:48.131406 Return from record_task_log_end in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:628 to main_loop in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:879
2024-04-23 11:38:48.131463 Call to get_serialization_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:660 from main_loop in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:879
2024-04-23 11:38:48.131484 Call to current_job_id in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:491 from get_serialization_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:670
2024-04-23 11:38:48.131505 Return from current_job_id in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:494 to get_serialization_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:670
2024-04-23 11:38:48.131530 Return from get_serialization_context in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:672 to main_loop in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:879
2024-04-23 11:38:48.131550 Call to serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:482 from main_loop in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:879
2024-04-23 11:38:48.131573 Call to _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:433 from serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:494
2024-04-23 11:38:48.131614 Call to packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:30 from _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:468
2024-04-23 11:38:48.131645 Return from packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:36 to _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:468
2024-04-23 11:38:48.131678 Call to packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:30 from _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:478
2024-04-23 11:38:48.131701 Return from packb in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/msgpack/__init__.py:36 to _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:478
2024-04-23 11:38:48.131721 Return from _serialize_to_msgpack in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:478 to serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:494
2024-04-23 11:38:48.131747 Return from serialize in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/serialization.py:494 to main_loop in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:879
2024-04-23 11:38:48.131796 Call to _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2528 from __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142
2024-04-23 11:38:48.131819 Call to _mode in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:3051 from _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2530
2024-04-23 11:38:48.131838 Return from _mode in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:3059 to _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2530
2024-04-23 11:38:48.131882 Return from _changeproctitle in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:2531 to __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142
2024-04-23 11:38:48.131959 Call to disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:67 from __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142
2024-04-23 11:38:48.131985 Call to _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:38 from disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69
2024-04-23 11:38:48.132008 Return from _set_client_hook_status in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:40 to disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69
2024-04-23 11:38:48.132026 Return from disable_client_hook in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:69 to __exit__ in /root/anaconda3/envs/vllm/lib/python3.10/contextlib.py:142 |
I think there is something wrong with your ray environment, but I'm not sure. |
@youkaichao this hang has not been found at 0.4.0.post1 in my test |
@youkaichao can you re-pro this? still can not exit. del llm
(RayWorkerWrapper pid=2301191) INFO 04-28 17:56:00 model_runner.py:953] Graph capturing finished in 9 secs. [repeated 6x across cluster]
(RayWorkerWrapper pid=2301191) [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:10.189.108.254]:8949 (errno: 97 - Address family not supported by protocol). [repeated 6x across cluster]
# hang .... |
@youkaichao after i del driver_worker.woker.model_runner manually, the process can exit normally! it seems the cuda graph captured by driver_worker.woker.model_runner can not be del automatic when the process exiting. del llm.llm_engine.model_executor.driver_worker.worker.model_runner script to re-produce def main(args):
model_name = args.model
llm = LLM(model="Qwen/Qwen1.5-72B-Chat", # any model is OK.
tokenizer_mode='slow',
trust_remote_code=True,
tensor_parallel_size=8,
max_model_len=8192,
swap_space=4,
gpu_memory_utilization=0.9,
disable_custom_all_reduce=True,
enable_prefix_caching=True,
enforce_eager=False)
tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen1.5-72B-Chat", use_fast=False, trust_remote_code=True)
sampling_params = SamplingParams(best_of=1,
frequency_penalty=0.0,
temperature=0,
max_tokens=512,
presence_penalty=1.0,
top_p=1.0,
skip_special_tokens=True,
include_stop_str_in_output=False)
request_output = llm.generate("你是谁?",
sampling_params=sampling_params,
use_tqdm=False)
# del model_runner manually. without this, the process can not exit normally.
del llm.llm_engine.model_executor.driver_worker.worker.model_runner I would be very grateful if you could take a look at this question. The |
Will take a look in the next week. In addition, I think this might be related to how ray manages processes. cc @rkooo567 FYI. |
@youkaichao many thanks~ |
@youkaichao let me know if you need any assistance! The fact that |
@youkaichao @rkooo567 This error no longer exists in the latest vllm, can you guys tell me how to fix it? Somewhat interested. 2024-05-09 14:23:39.711582 Call to connected in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:473 from __del__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/actor.py:1348
2024-05-09 14:23:39.711599 Return from connected in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py:476 to __del__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/actor.py:1348
2024-05-09 14:23:39.711616 Return from __del__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/ray/actor.py:1348 to in :0
2024-05-09 14:23:39.711634 Call to __del__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py:1019 from in :0
2024-05-09 14:23:39.769073 Return from __del__ in /root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py:1027 to in :0 |
Hmm I am not sure what's changed, but @youkaichao made several PRs to clean up tp > 1 cases. Maybe it was fixed by that... |
Might be related with #4508 (comment) . |
Your current environment
🐛 Describe the bug
offline test, Process hangs without exiting when using cuda graph
but without cuda graph, it will exit normally. It is somethings wrong while using cuda graph in vllm?
The text was updated successfully, but these errors were encountered: