You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Dev][Doc] Enhance Flash Attention Implementation in GQA Decoding Example and Fix Typo (#139)
- Add non-split flash attention macro for more flexible kernel generation
- Implement `main_no_split` function to handle single-split scenarios
- Modify kernel selection logic to dynamically choose between split and non-split implementations
Copy file name to clipboardExpand all lines: examples/deepseek_mla/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -116,7 +116,7 @@ Here, `T.annotate_layout` allows users to specify any desired layout for a buffe
116
116
117
117
### Warp-Specialization
118
118
119
-
The Hopper architecture commonly employs warp specialization for performance optimization. A typical approach is to designate one warpgroup as a producer that handles data movement using TMA (Tensor Memory Access), while the remaining warpgroups serve as consumers performing computations. However, this programming pattern is complex, requiring developers to manually manage the execution logic for producers and consumers, including synchronization through the `mbarrier` objects.
119
+
The Hopper architecture commonly employs warp specialization for performance optimization. A typical approach is to designate one warpgroup as a producer that handles data movement using TMA (Tensor Memory Accelerator), while the remaining warpgroups serve as consumers performing computations. However, this programming pattern is complex, requiring developers to manually manage the execution logic for producers and consumers, including synchronization through the `mbarrier` objects.
120
120
121
121
In TileLang, users are completely shielded from these implementation details. The frontend script is automatically transformed into a warp-specialized form, where TileLang handles all producer-consumer synchronization automatically, enabling efficient computation.
0 commit comments