Commit e5135a9
fix bug&add amd examples (tile-ai#966)
* [Enhancement] Refactor buffer index handling for improved precision and clarity (tile-ai#668)
- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.
* Remove obsolete test script for AMD example, streamlining the examples directory.
* Remove unused dtype_size variable in AMD example script to streamline code.
* Add input configuration file and update AMD example script for enhanced flexibility
- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.
* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.
* Refactor AMD example script for FlashAttention-2
- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.
* Refactor formatting in AMD FlashAttention example script
- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.
* Update example_amd_flash_attn_fwd.py
* Enhance AMD example script and update CI workflows
- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.
* Remove redundant tool cache cleanup step in AMD CI workflow
* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
* Add new AMD FlashAttention example and test script
- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.
* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner
- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.
* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217
* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c
* Add example for AMD Flash Attention backward pass implementation
- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.
This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.
* Enhance AMD Flash Attention example with additional testing capabilities
- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.
This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.
* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
* Refactor HIP intrinsic rules to CUDA
- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.
* Update AMD CI workflow to uninstall specific PyTorch packages before installation
- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.
* Remove unused shared memory allocations in AMD Flash Attention backward example
- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.
* Remove unnecessary pip uninstall command from AMD CI workflow
- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.
* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules
- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.
* Refactor formatting of HIP intrinsic rule registrations
- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.
* Update file name and documentation for HIP intrinsic rules
- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
- Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.
* Enhance DispatchHIPShuffle function with clang-analyzer comments
- Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
- This change improves code clarity and maintains compliance with static analysis tools.
* lint fix
* fix
* Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.
* Add backward attention example to test script
- Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
- Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
- Updated the command for the backward tile example to match the new configuration.
* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py
- Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
- Updated the FlashAttention forward and backward functions to improve performance and maintainability.
- Adjusted test script parameters for consistency and clarity, including the addition of group handling.
- Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
- Updated the main function to reflect changes in parameter names and types for better usability.
* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
- Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
- Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
- Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
- Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
- Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.
* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
- Introduced a scaling factor for improved numerical stability in gradient calculations.
- Optimized shared memory usage by adding new shared buffers for intermediate calculations.
- Refined the handling of tensor fragments to improve performance and maintainability.
- Updated the main function to ensure compatibility with the new output parameters for backward operations.
- Removed unnecessary parameters from the test script to streamline execution.
* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py
- Updated the forward and backward functions to improve numerical stability and performance.
- Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
- Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
- Added debugging and benchmarking functions for improved correctness verification and performance analysis.
- Updated the main function to reflect changes in parameter handling and ensure consistency across examples.
* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
- Updated scaling factor application for improved numerical stability in gradient calculations.
- Refined tensor handling to ensure consistency with forward pass operations.
- Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
- Adjusted comments for clarity and alignment with standard implementation practices.
* Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh
- Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
- Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
- Improved comments for clarity and alignment with the updated configurations.
* Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py
- Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
- Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
- Improved comments for better understanding of the performance metrics and implementation details.
- Removed unnecessary parameter from test.sh to streamline execution.
* Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.
* Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py
- Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
- Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
- Refined scaling factor calculations for numerical stability in both forward and backward passes.
- Improved comments and documentation for clarity and consistency across implementations.
- Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.
* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py
- Removed outdated comments and improved clarity in the code.
- Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
- Updated autotuner configurations to include new parameters for better performance tuning.
- Refined tensor handling and scaling factor calculations for improved numerical stability.
- Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.
* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
- Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
- Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
- Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
- Enhanced GEMM operations with k_pack for improved computational efficiency.
- Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.
* Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py
- Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
- Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
- Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
- Adjusted GEMM operations to improve computational efficiency without the use of k_pack.
* Enhance HIP code generation and FP8 type support
- Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
- Updated error logging to include unsupported FP8 type details for better debugging.
- Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
- Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
- Added overloads for AtomicAdd in common.h to support both pointer and value arguments.
* Enhance FP8 type support and clarify accumulator handling in HIP
- Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
- Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
- Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.
* Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py
* Update print statement formatting for clarity in example_amd_flash_attn_bwd.py
* Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.
* Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.
* Refactor and enhance HIP code generation for improved FP8 support
- Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
- Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
- Updated AtomicAdd function in common.h to streamline its implementation.
- Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
- Added function to manage the addition of new functions in the code generation process.
* Fix formatting issue in HIP code generation for MFMA call
- Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.
* Refactor HIP code generation and enhance FP8 type handling
- Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
- Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
- Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
- Refined the AddFunction method to streamline function addition in the code generation process.
* Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.
* Refactor backward attention implementation in example_amd_flash_attn_bwd.py
- Updated the GEMM operation to use shared memory for improved performance.
- Adjusted parallelization parameters to enhance efficiency in the backward pass.
* Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.
* Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations
* Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`
---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>1 parent b90e784 commit e5135a9
File tree
9 files changed
+625
-325
lines changed- examples
- amd
- flash_attention
- src
- target
- tl_templates/hip
- testing/python/amd
9 files changed
+625
-325
lines changedLarge diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
| 37 | + | |
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | 63 | | |
76 | 64 | | |
77 | 65 | | |
| |||
95 | 83 | | |
96 | 84 | | |
97 | 85 | | |
98 | | - | |
| 86 | + | |
99 | 87 | | |
100 | 88 | | |
101 | 89 | | |
| |||
185 | 173 | | |
186 | 174 | | |
187 | 175 | | |
188 | | - | |
| 176 | + | |
189 | 177 | | |
190 | 178 | | |
191 | 179 | | |
192 | 180 | | |
193 | 181 | | |
194 | 182 | | |
195 | 183 | | |
196 | | - | |
| 184 | + | |
197 | 185 | | |
198 | 186 | | |
199 | 187 | | |
| |||
This file was deleted.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
41 | | - | |
42 | 41 | | |
43 | 42 | | |
44 | 43 | | |
45 | 44 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | 45 | | |
50 | 46 | | |
51 | 47 | | |
| |||
192 | 188 | | |
193 | 189 | | |
194 | 190 | | |
195 | | - | |
196 | | - | |
197 | | - | |
198 | 191 | | |
199 | 192 | | |
200 | 193 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
44 | 48 | | |
45 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
46 | 54 | | |
47 | | - | |
| 55 | + | |
48 | 56 | | |
49 | 57 | | |
50 | 58 | | |
| |||
926 | 934 | | |
927 | 935 | | |
928 | 936 | | |
929 | | - | |
930 | | - | |
931 | | - | |
932 | | - | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
933 | 941 | | |
934 | 942 | | |
935 | 943 | | |
| |||
955 | 963 | | |
956 | 964 | | |
957 | 965 | | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
958 | 973 | | |
959 | 974 | | |
960 | 975 | | |
| |||
1160 | 1175 | | |
1161 | 1176 | | |
1162 | 1177 | | |
1163 | | - | |
| 1178 | + | |
| 1179 | + | |
1164 | 1180 | | |
1165 | 1181 | | |
1166 | 1182 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
74 | 76 | | |
75 | 77 | | |
76 | 78 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
8 | 15 | | |
9 | 16 | | |
10 | 17 | | |
| |||
43 | 50 | | |
44 | 51 | | |
45 | 52 | | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
46 | 101 | | |
47 | 102 | | |
48 | 103 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
238 | 238 | | |
239 | 239 | | |
240 | 240 | | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
241 | 247 | | |
242 | 248 | | |
243 | 249 | | |
| |||
0 commit comments