[Performance] Accelerate cooperative kernel in single/batch decode. #62

yzh119 · 2024-01-09T15:55:05Z

The single/batch decode (the implementation w/o Tensor Cores) can still be improved with better scheduling after grid.sync().

This reverts commit 9e0d84e.

yzh119 · 2024-01-18T11:16:36Z

The PR don't show stable performance, use #72 instead.

This PR adds the necessary headers for mma ops. It also adds a unit test for a `16x16` `MFMA` op using wavefront intrinsics. To test the PR ``` cd flashinfer/libflashinfer/tests/hip/ ``` ``` mkdir build && cd build ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include .. ``` ``` make test_mfma_fp32_16x16x16fp16 ``` ``` ./test_mfma_fp32_16x16x16fp16 ``` The result should look like ``` [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from MfmaTest [ RUN ] MfmaTest.CorrectResults [ OK ] MfmaTest.CorrectResults (2348 ms) [----------] 1 test from MfmaTest (2348 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (2348 ms total) [ PASSED ] 1 test. ```

yzh119 added 6 commits January 9, 2024 10:52

wip

959dfa5

fix

e7fade4

upd

8b6ab1a

fix

9e0d84e

Revert "fix"

08ec81b

This reverts commit 9e0d84e.

upd

04ab724

yzh119 marked this pull request as draft January 10, 2024 13:46

yzh119 closed this Jan 18, 2024

yzh119 deleted the accelerate-cooperative branch January 31, 2024 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Accelerate cooperative kernel in single/batch decode. #62

[Performance] Accelerate cooperative kernel in single/batch decode. #62

Uh oh!

yzh119 commented Jan 9, 2024 •

edited

Loading

Uh oh!

yzh119 commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Performance] Accelerate cooperative kernel in single/batch decode. #62

[Performance] Accelerate cooperative kernel in single/batch decode. #62

Uh oh!

Conversation

yzh119 commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yzh119 commented Jan 9, 2024 •

edited

Loading