[CI Failure]: AMD structured outputs tests OOM with @pytest.mark.skip_global_cleanup

### Name of failing test

tests/v1/entrypoints/llm/test_struct_output_generate.py

### Basic information

- [ ] Flaky test
- [x] Can reproduce locally
- [ ] Caused by external libraries (e.g. bug in `transformers`)

### 🧪 Describe the failing test

Structured outputs tests were added in  https://github.com/vllm-project/vllm/pull/12388 with `@pytest.mark.skip_global_cleanup` to speed up testing time, however this is causing tests OOMs on AMD CI specifically. ([example](https://buildkite.com/vllm/amd-ci/builds/736#019a3390-9acd-43ae-9496-9deeed1db4d8)) 
```

2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     self.driver_worker.init_device()
-- | --
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 308, in init_device
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     self.worker.init_device()  # type: ignore
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     ^^^^^^^^^^^^^^^^^^^^^^^^^
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 207, in init_device
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779]     raise ValueError(
  | 2025-10-30 05:48:54 UTC | (EngineCore_DP0 pid=3997) ERROR 10-30 05:48:54 [core.py:779] ValueError: Free memory on device (24.92/255.98 GiB) on startup is less than desired GPU memory utilization (0.9, 230.39 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
```

With `@pytest.mark.skip_global_cleanup`, these tests will not call `cleanup_dist_env_and_memory` - which calls gc and cleanup pytorch caches, some hypothesis: 

1. NCCL ProcessGroup is leaking in ROCM, as suggested by this log:
```
[rank0]:[W1030 05:48:55.456939484 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
--
  | 2025-10-30 05:48:56 UTC | FAILED
```
2. GC behavior / Torch compile difference between CUDA and ROCM.

As for a short term mitigation, we will call `cleanup_dist_env_and_memory` if these tests are running on AMD.

### 📝 History of failing test

https://buildkite.com/vllm/amd-ci/builds/736#019a3390-9acd-43ae-9496-9deeed1db4d8

### CC List.

@aarnphm @mgoin @WoosukKwon @Lucaskabela @zou3519 @mxz297 @gshtras 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI Failure]: AMD structured outputs tests OOM with @pytest.mark.skip_global_cleanup #27844

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[CI Failure]: AMD structured outputs tests OOM with @pytest.mark.skip_global_cleanup #27844

Description

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions