Skip to content

[CI Failure]: AMD structured outputs tests OOM with @pytest.mark.skip_global_cleanup #27844

@zhewenl

Description

@zhewenl

Name of failing test

tests/v1/entrypoints/llm/test_struct_output_generate.py

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Structured outputs tests were added in #12388 with @pytest.mark.skip_global_cleanup to speed up testing time, however this is causing tests OOMs on AMD CI specifically. (example)


2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     self.driver_worker.init_device()
-- | --
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 308, in init_device
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     self.worker.init_device()  # type: ignore
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     ^^^^^^^^^^^^^^^^^^^^^^^^^
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 207, in init_device
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     raise ValueError(
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] ValueError: Free memory on device (24.92/255.98 GiB) on startup is less than desired GPU memory utilization (0.9, 230.39 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

With @pytest.mark.skip_global_cleanup, these tests will not call cleanup_dist_env_and_memory - which calls gc and cleanup pytorch caches, some hypothesis:

  1. NCCL ProcessGroup is leaking in ROCM, as suggested by this log:
[rank0]:[W1030 05:48:55.456939484 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
--
  | 2025-10-30 05:48:56 UTC | FAILED
  1. GC behavior / Torch compile difference between CUDA and ROCM.

As for a short term mitigation, we will call cleanup_dist_env_and_memory if these tests are running on AMD.

📝 History of failing test

https://buildkite.com/vllm/amd-ci/builds/736#019a3390-9acd-43ae-9496-9deeed1db4d8

CC List.

@aarnphm @mgoin @WoosukKwon @Lucaskabela @zou3519 @mxz297 @gshtras

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CIrocmRelated to AMD ROCm

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions