-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Open
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CIrocmRelated to AMD ROCmRelated to AMD ROCm
Description
Name of failing test
tests/v1/entrypoints/llm/test_struct_output_generate.py
Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
🧪 Describe the failing test
Structured outputs tests were added in #12388 with @pytest.mark.skip_global_cleanup to speed up testing time, however this is causing tests OOMs on AMD CI specifically. (example)
2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] self.driver_worker.init_device()
-- | --
| 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 308, in init_device
| 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] self.worker.init_device() # type: ignore
| 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] ^^^^^^^^^^^^^^^^^^^^^^^^^
| 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 207, in init_device
| 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] raise ValueError(
| 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] ValueError: Free memory on device (24.92/255.98 GiB) on startup is less than desired GPU memory utilization (0.9, 230.39 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
With @pytest.mark.skip_global_cleanup, these tests will not call cleanup_dist_env_and_memory - which calls gc and cleanup pytorch caches, some hypothesis:
- NCCL ProcessGroup is leaking in ROCM, as suggested by this log:
[rank0]:[W1030 05:48:55.456939484 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
--
| 2025-10-30 05:48:56 UTC | FAILED
- GC behavior / Torch compile difference between CUDA and ROCM.
As for a short term mitigation, we will call cleanup_dist_env_and_memory if these tests are running on AMD.
📝 History of failing test
https://buildkite.com/vllm/amd-ci/builds/736#019a3390-9acd-43ae-9496-9deeed1db4d8
CC List.
@aarnphm @mgoin @WoosukKwon @Lucaskabela @zou3519 @mxz297 @gshtras
Metadata
Metadata
Assignees
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CIrocmRelated to AMD ROCmRelated to AMD ROCm
Type
Projects
Status
No status