perf: use f16 as split-k partial output data type #900

yzh119 · 2025-02-27T02:13:04Z

We use f32 as split-k partial output data type, mainly for accuracy concern.
However, the official FlashMLA implementation uses 16-bit partial data type, so I guess 16bit split-k is enough for production.
This PR changes flashinfer mla implementation's intermediate data type to 16bit.

This is the third piece (4 in total) of split-k related optimizations.

This reverts commit ffa9439.

muoshuosha · 2025-02-27T09:46:24Z

If the input and output are bf16, is the accuracy still sufficient without using fp32 for partial output?

yzh119 · 2025-02-27T21:25:26Z

Hi @muoshuosha FlashMLA official implementation uses bf16 as partial output, when inputs and outputs are bf16. Here we just align with the official design.

yzh119 · 2025-03-04T03:01:13Z

As pointed out by @beginlner, flashmla uses f32 as intermediate output data type. We should make the partial O dtype a template parameter.

This PR applies changes in flashinfer-ai#898 and flashinfer-ai#900 to the MLA TVM binding.

This PR applies changes in #898 and #900 to the MLA TVM binding.

yzh119 added 4 commits February 26, 2025 04:39

upd

7273179

upd

15e8eb2

upd

0e9baa6

revert changes to tests and benchmarks

2c6f336

yzh119 mentioned this pull request Feb 27, 2025

[Tracking Issue] MLA performance tracking #897

Open

10 tasks

yzh119 added 4 commits February 27, 2025 03:19

upd

ffa9439

Revert "upd"

96d11b9

This reverts commit ffa9439.

upd

6ac2230

upd

33c3e58

yzh119 merged commit e4a68e4 into main Feb 27, 2025

zhyncs deleted the mla-16bit-splitk branch February 27, 2025 16:38

MasterJH5574 added a commit to MasterJH5574/flashinfer that referenced this pull request Mar 13, 2025

fix: Fix MLA TVM binding for the latest changes

c808c53

This PR applies changes in flashinfer-ai#898 and flashinfer-ai#900 to the MLA TVM binding.

MasterJH5574 mentioned this pull request Mar 13, 2025

fix: Fix MLA TVM binding for the latest changes #940

Merged

yzh119 pushed a commit that referenced this pull request Mar 13, 2025

fix: Fix MLA TVM binding for the latest changes (#940)

534c3c9

This PR applies changes in #898 and #900 to the MLA TVM binding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: use f16 as split-k partial output data type #900

perf: use f16 as split-k partial output data type #900

Uh oh!

yzh119 commented Feb 27, 2025

Uh oh!

muoshuosha commented Feb 27, 2025

Uh oh!

yzh119 commented Feb 27, 2025 •

edited

Loading

Uh oh!

yzh119 commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: use f16 as split-k partial output data type #900

perf: use f16 as split-k partial output data type #900

Uh oh!

Conversation

yzh119 commented Feb 27, 2025

Uh oh!

muoshuosha commented Feb 27, 2025

Uh oh!

yzh119 commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzh119 commented Feb 27, 2025 •

edited

Loading