Skip to content

[CI Failure]: CUTLASS MLA decode is flaky #24590

@MatthewBonanni

Description

@MatthewBonanni

Name of failing test

tests/kernels/test_cutlass_mla_decode.py::test_cutlass_mla_decode[torch_dtype1-False-True-64-512-576-1-16-4096-1-128]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

FP8 cases fail CUTLASS MLA decode test

    def cal_diff(x: torch.Tensor,
                 y: torch.Tensor,
                 name: str,
                 use_fp8: bool = False) -> None:
        x, y = x.double(), y.double()
        cos_diff = 1 - 2 * (x * y).sum().item() / max(
            (x * x + y * y).sum().item(), 1e-12)
        if (use_fp8):
>           assert cos_diff < 1e-4
E           assert 1.0 < 0.0001

📝 History of failing test

Started to fail when I added FP8 cases in #23289 (only FP8 cases fail)

https://buildkite.com/vllm/ci/builds/29987#01992f55-5a98-42e8-9589-751e26e35165

CC List.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CI

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions