Skip to content

[CI Failure]: Speculative decoding tests - spec_decode/e2e/test_eagle_correctness.py #20214

@mgoin

Description

@mgoin

Name of failing test

spec_decode/e2e/test_eagle_correctness.py::test_llama3_eagle_e2e_greedy_correctness[1-1-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

It doesn't fail locally but that might be because the OOM is specific to the L4 we use in CI

https://buildkite.com/vllm/ci/builds/22853/steps/canvas?jid=0197b520-e1dc-4ace-bfdc-f483b4dee76f

[2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_llama3_eagle_e2e_greedy_correctness[1-1-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 116.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 112.12 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 113.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_llama3_eagle_e2e_greedy_correctness[1-5-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 116.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 112.12 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 113.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
 [2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_qwen2_eagle_e2e_greedy_correctness[1-1-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 862.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 394.12 MiB is free. Including non-PyTorch memory, this process has 21.64 GiB memory in use. Of the allocated memory 21.27 GiB is allocated by PyTorch, and 119.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-06-28T09:19:58Z] FAILED spec_decode/e2e/test_eagle_correctness.py::test_qwen2_eagle_e2e_greedy_correctness[1-5-32-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 862.00 MiB. GPU 0 has a total capacity of 22.05 GiB of which 394.12 MiB is free. Including non-PyTorch memory, this process has 21.64 GiB memory in use. Of the allocated memory 21.27 GiB is allocated by PyTorch, and 119.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

📝 History of failing test

These tests seem to have been failing since they were added?
Image
https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/f5787f7b-48c2-83fa-85e4-b02c88a7fa74?period=28days&tags=scm.branch%3Amain

CC List.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CI

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions