-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Compilation] Optimization for compilation process #426
Conversation
529030c
to
c5fc4db
Compare
did you measure compilation speedup? |
c5fc4db
to
403a108
Compare
optimize the compilation behavior for each core. Instead of generating a cuda source file for each core, congragate them into a single source file to reduce compilation overhead
403a108
to
51fb21a
Compare
32 core i913900K + 4090
|
didn't you notice how many candidates are in matmul and bert? |
https://docs.google.com/spreadsheets/d/1uVBh1hCafwhOEeRdrFnuoaz4HOFECyr5cl7WnuMm7Ns/edit?usp=sharing a. 'lower' refers to the time spent on IR lowering (kernel level optimiztion 'compile_opt' is the optimized branch. |
Useful data, thx @hunbssfy
|
|
' 1. Thanks for clarification. How many CPU thread did you have on machine where you did measure? BTW From you spreadsheet. It is clear that if we have more candidates we have bigger improvement. matmul, conv, attn have similar number of candidates. For matmul and conv improvement 8.05x and 7.22x. But attn has only 3.83x. Maybe it's worth to look into it. Maybe the reason just because attn has much more code in .cu file but maybe we have some misprint or didn't take into account something. |
@vadiklyutiy I used 32 cores so that is 32 threads. I also noticed this unbalanced acceleration percentage. I'll post new discoveries if I got any. |
Thanks @hunbssfy @vadiklyutiy ! What if different IRModules have functions with the same name but different implementations? For example, we might have a function in the schedule template which will have different implementations for different hyper-parameters. When we merge the generated code for multiple IRModules, there might be name conflict issue. |
But as I understand finally they are linked to the lib.o. Does this issue exist in current implementation? |
@yaoyaoding could you response |
They are private functions in each |
@yaoyaoding I went through the generated .cu file during compilation. Here is one of them:
As you can see, all candidates have a
They look normal. So I don't think there will be a function name conflict issue. |
@yaoyaoding I inspected the final lib.so file. There are two types of APIs which are GLOBAL. The first one is the
the second one is these hidet_launch functions:
|
I read all comments again and seems there is some misunderstanding. So, this
cannot happen. Does it make sense? |
Even the candidates are fot the same operator, they are represented by different IRModules. The problem that I previously worried about will not happen in the case that all IRModules that will be batch compiled are coming from the same operator as we will set different namespace for different IRModules. I used namespace to avoid name conflict in the shared library when I group all the kernels in the same shared library (previously, they are in diffferent shared libraries). This alleviates the potential problem of function name conflict problem. There is still potential name conflict problem if the batch compile API (i.e., |
Hi @yaoyaoding , I added some unqiueness function name check after the regrouping of IRModule |
Allow prologue for fp32 `reduce`. `reduce` uses vectorized calculations that don't allow to use fusing(it is possible but not implemented yet). For fp32 there are no vectors and we can enable fusion (with small modification `reduce` kernels itself). Motivation. In llama2 the part of the calculation is fp32 including `pow`+`reduce`. Performance improvement on llama2-7B +0.241%
Allow prologue for fp32 `reduce`. `reduce` uses vectorized calculations that don't allow to use fusing(it is possible but not implemented yet). For fp32 there are no vectors and we can enable fusion (with small modification `reduce` kernels itself). Motivation. In llama2 the part of the calculation is fp32 including `pow`+`reduce`. Performance improvement on llama2-7B +0.241%
Allow prologue for fp32 `reduce`. `reduce` uses vectorized calculations that don't allow to use fusing(it is possible but not implemented yet). For fp32 there are no vectors and we can enable fusion (with small modification `reduce` kernels itself). Motivation. In llama2 the part of the calculation is fp32 including `pow`+`reduce`. Performance improvement on llama2-7B +0.241%
optimize the compilation behavior for each core. Instead of generating a cuda source file for each core, congragate them into a single source file to reduce compilation overhead.
Initial test reveals a 20% decrease in the total compilation+run time of bert model (222s -> 177s). Also the same percentage of decrease in operator tests.