-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Closed
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CI
Description
Name of failing test
spec_decode/e2e/test_eagle_correctness.py::test_llama3_eagle_e2e_greedy_correctness[1-1-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]
Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
🧪 Describe the failing test
It doesn't fail locally but that might be because the OOM is specific to the L4 we use in CI
https://buildkite.com/vllm/ci/builds/22853/steps/canvas?jid=0197b520-e1dc-4ace-bfdc-f483b4dee76f
[2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_llama3_eagle_e2e_greedy_correctness[1-1-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 116.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 112.12 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 113.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_llama3_eagle_e2e_greedy_correctness[1-5-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 116.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 112.12 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 113.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_qwen2_eagle_e2e_greedy_correctness[1-1-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 862.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 394.12 MiB is free. Including non-PyTorch memory, this process has 21.64 GiB memory in use. Of the allocated memory 21.27 GiB is allocated by PyTorch, and 119.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_qwen2_eagle_e2e_greedy_correctness[1-5-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 862.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 394.12 MiB is free. Including non-PyTorch memory, this process has 21.64 GiB memory in use. Of the allocated memory 21.27 GiB is allocated by PyTorch, and 119.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
📝 History of failing test
These tests seem to have been failing since they were added?

https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/f5787f7b-48c2-83fa-85e4-b02c88a7fa74?period=28days&tags=scm.branch%3Amain
CC List.
No response
Metadata
Metadata
Assignees
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CI
Type
Projects
Status
Done