Skip to content

Conversation

@chengyupku
Copy link
Contributor

  • Remove redundant acc_s_0 fragment in flash attention kernel
  • Simplify memory copy and reduction operations
  • Reorder memory copy and scaling steps for improved performance
  • Add Hopper-specific synchronization method in CUDA reduce template
  • Update reduce operation to use architecture-specific synchronization

…cc_s from float to float16

- Remove redundant `acc_s_0` fragment in flash attention kernel
- Simplify memory copy and reduction operations
- Reorder memory copy and scaling steps for improved performance
- Add Hopper-specific synchronization method in CUDA reduce template
- Update reduce operation to use architecture-specific synchronization
@chengyupku chengyupku merged commit 16b919b into tile-ai:main Mar 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant