Skip to content

Conversation

@yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Jan 9, 2024

The single/batch decode (the implementation w/o Tensor Cores) can still be improved with better scheduling after grid.sync().

@yzh119 yzh119 marked this pull request as draft January 10, 2024 13:46
@yzh119
Copy link
Collaborator Author

yzh119 commented Jan 18, 2024

The PR don't show stable performance, use #72 instead.

@yzh119 yzh119 closed this Jan 18, 2024
@yzh119 yzh119 deleted the accelerate-cooperative branch January 31, 2024 08:33
diptorupd pushed a commit to ROCm/flashinfer that referenced this pull request Sep 29, 2025
This PR adds the necessary headers for mma ops. It also adds a unit test
for a `16x16` `MFMA` op using wavefront intrinsics.

To test the PR

```
cd flashinfer/libflashinfer/tests/hip/
```
```
mkdir build && cd build
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include ..
```
```
make test_mfma_fp32_16x16x16fp16
```
```
./test_mfma_fp32_16x16x16fp16
```

The result should look like
```
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from MfmaTest
[ RUN      ] MfmaTest.CorrectResults
[       OK ] MfmaTest.CorrectResults (2348 ms)
[----------] 1 test from MfmaTest (2348 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2348 ms total)
[  PASSED  ] 1 test.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants