-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Documentation]: The overall tuning idea of aotriton #38
Comments
Yes. In fact, not only the To avoid these problems by using extreme configs, we developed a more sophisticated tuning system (but under refactoring due to design problems). However either the old
The optimal kernel lookup in AOTriton is done in pure C++ by generated code, if you built AOTriton from source, this process can be in found the following file(s) (using
As I described above, the tuning is not done during build for obvious reasons: you need corresponding GPUs installed and a clean environment to tune any Triton kernel, which is impractical for a build node. Building process only use the results stored in the tuning database. |
|
It's expected to have worse performance on backward kernels. An upcoming release will fix the problem and matches the math backend's performance. Further improvements require additional works on Triton compiler optimizations.
If the optimal kernel for configurations stays the same, certainly the tuning database will not change even if
Just use it as conventional C++ library. You don't need to do anything to specify the optimal kernel (otherwise there is no point of including a tuning database during the build) The upcoming release will have options to let you select kernels manually but which is reserved to generate the tuning database. Normally you don't need to care about which kernel you eventually called since the dispatcher will select the optimal one according to the tuning database used during the compiling. |
Thank you very much for your reply! I have the following questions and look forward to your response: 1: When I only use the default tuning data sliders 'BLOCK_M': 128, 'BLOCK_N': 64 in your code to generate the static library libaotriton_v2.a, and call the libaotriton_v2.a static library to test the forward and reverse performance, and then compare it with the forward and reverse performance of Flash V2 on the torch side, I find that the forward performance generated by the libaotriton_v2.a static library is slightly higher than the forward performance of Flash2 on the torch side, but the reverse performance is much worse. So I added several sets of BLOCK_M and BLOCK_N sliders in an attempt to find the best sliders to improve the performance in the reverse direction. I verified that the added 'BLOCK_M': 64, 'BLOCK_N': 32 sliders have higher performance through aotriton/tritonsrc/performance_forward.py. I found that the added sliders were written into the database through tritonsrc/tune_flash.py. After the compilation was successful, I also found 'BLOCK_M': 64, 'BLOCK_N': in build/v2src/flash/autotune.bwd_kernel_dk_dv: 32 corresponds to /public/home/zhangqha/test_code/aotriton/aotriton/build/v2src/flash/gpu_kernel_image.bwd_kernel_dk_dv/bwd_kernel_dk_dv-Sig-F__^bf16@16_16_False_False_False_1__P__64_32__CO__warp4_stg1_wave0-Gpu-K100_AI.hsaco. This set of 'BLOCK_M': 64, 'BLOCK_N': 32 sliders does exist; when testing the reverse performance of the libaotriton_v2.a static library, it was found that its reverse performance still did not improve? Can you explain in detail why? 2: I found that no matter how I adjust the values of the BLOCK_M and BLOCK_N sliders, the reverse results of calling the libaotriton_v2.a static library cannot surpass it compared with Flash V2. Can you explain the reason in detail? Finally, after testing, the results were stable at libaotriton_v2.a static library bwd: 3.48 TFLOPs/s, Flash V2bwd: 4.72 TFLOPs/s. Are the above results normal? Is there any other way to improve the reverse performance of libaotriton_v2.a static library bwd? 3: How can I find out which group of BLOCK_M and BLOCK_N slider kernels are called when I use causal=False, nheads=64, headdim=64, batch_size=2, seqlen=2048 to test libaotriton_v2.a? Because I found that no matter whether I added better performance BLOCK_M and BLOCK_N sliders to the database, and the better performance BLOCK_M and BLOCK_N sliders were also present in the hsaco file after successful compilation, the reverse operator performance basically did not change when I tested the libaotriton_v2.a static library. I suspect that the libaotriton_v2.a static library did not use the kernel corresponding to the better performance BLOCK_M and BLOCK_N sliders I added when I called it. 4: When I write my customized n_heads, seqlen_q and seqlen_k into the database through tritonsrc/tune_flash.py, for example, n_heads 5 8 10 20 32 64 --d_head 16 32 64 128 --seqlen_q 64 128 256 512 1024 2048 4096 --seqlen_k 64 128 256 512 1024 2048 4096 --causal 0 1 --dropout_p 0.0 --dtype float16 bfloat16 --bias_type 0 1. I found that in the multiple .cc codes of build/v2src/flash/autotune.attn_fwd generated after successful compilation, it only selects the corresponding kernel through the length index of seqlen_q and seqlen_k. What is the selection method of n_heads and d_head parameters? When it selects the corresponding kernel through seqlen_q and seqlen_k, how does it determine which n_heads and d_head values should be selected? 5: I would like to ask, how do you judge that when using the generated static library libaotriton_v2.a, the performance of the kernel it finally selects is the best? Where is the specific implementation code? I would be grateful if you could take the time to help answer these questions. |
The backward performance is a known problem. The best performance is just on-par with Math backend and we have already enumerate
The reason is sophisticated but the short answer is the Triton compiler lacks some optimizations employed by cutlass/CK.
Compile AOTriton with
D_HEAD is determined In Step 1. N_HEADS is not considered for tuning due to its low impact on performance: each head number directly translates to GPU block number, and each GPU block is processed independently.
For now it's determined by experimental results in aotriton/test/mptune/core/cpp_autotune.py Line 12 in b5f8997
(largely copied from Triton but included a validation step) |
Thank you very much for your reply! I have the following questions and look forward to your reply: 1: The commit version I use is 04b5df8 2: Below are the steps I took to use the aotriton library you made. I always feel that some steps are missing. I look forward to your corrections: Have I missed some important steps in the above steps or are some steps wrong? I look forward to your corrections. I would be grateful if you could take the time to help answer these questions. |
Description of errors
Hello, author. How should I call the best kernel after aotriton is compiled successfully? I see that you set the autotune parameter to false by default in the implementation of attn_torch_function.py. Does that mean the tuning process of block_m and block_n parameters is cancelled? If it is cancelled, what is the meaning of the generated kernel? What does the libaotriton_v2.a library generated at last contain? Can you tell me in detail? I call the libaotriton_v2.a library on the pytoch side to test the performance of the fa operator. I set autotune to True. There seems to be no process of finding the optimal kernel on the pytorch side. Where is the process of finding the optimal kernel and calling it implemented? How to implement it? I found that the content of the tuning_database.sqlite3 database did not change before and after compiling aotriton. What role does it play in the overall tuning process? Thank you very much for your answer!
Attach any links, screenshots, or additional evidence you think will be helpful.
No response
The text was updated successfully, but these errors were encountered: