[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 #141

chengyupku · 2025-03-05T05:34:09Z

Remove redundant acc_s_0 fragment in flash attention kernel
Simplify memory copy and reduction operations
Reorder memory copy and scaling steps for improved performance
Add Hopper-specific synchronization method in CUDA reduce template
Update reduce operation to use architecture-specific synchronization

…cc_s from float to float16 - Remove redundant `acc_s_0` fragment in flash attention kernel - Simplify memory copy and reduction operations - Reorder memory copy and scaling steps for improved performance - Add Hopper-specific synchronization method in CUDA reduce template - Update reduce operation to use architecture-specific synchronization

chengyupku merged commit 16b919b into tile-ai:main Mar 5, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 #141

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 #141

Uh oh!

chengyupku commented Mar 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 #141

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 #141

Uh oh!

Conversation

chengyupku commented Mar 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant