[Compilation] Optimization for compilation process #426

maxyanghu · 2024-02-13T22:30:04Z

optimize the compilation behavior for each core. Instead of generating a cuda source file for each core, congragate them into a single source file to reduce compilation overhead.

Initial test reveals a 20% decrease in the total compilation+run time of bert model (222s -> 177s). Also the same percentage of decrease in operator tests.

python/hidet/drivers/build_module.py

vadiklyutiy · 2024-02-15T17:23:26Z

did you measure compilation speedup?

optimize the compilation behavior for each core. Instead of generating a cuda source file for each core, congragate them into a single source file to reduce compilation overhead

maxyanghu · 2024-02-15T19:16:47Z

did you measure compilation speedup?

32 core i913900K + 4090

batch_matmul (this measurement is only for compile time IRModule->lower->cu file->.o->.so)
local compile: 80s -> 60s 25%
local compiler server (slower maybe due to cpu cores restriction when booting a docker compiler server locally): 104s -> 63s 40%
bert (this measures end to end time including inference and whole compilation process)
local compile(total time): 222s -> 177s 20%
local compiler server(total time): 321s -> 261s 19%

vadiklyutiy · 2024-02-15T20:03:42Z

didn't you notice how many candidates are in matmul and bert?

maxyanghu · 2024-02-19T19:54:31Z

https://docs.google.com/spreadsheets/d/1uVBh1hCafwhOEeRdrFnuoaz4HOFECyr5cl7WnuMm7Ns/edit?usp=sharing
I did a compilation time breakdown which shows the improvement this optimization has brought to the whole compliation time. This breakdown contains measurements of several common operators. I measured three parts of the compilation time:

a. 'lower' refers to the time spent on IR lowering (kernel level optimiztion
b. 'codegen' refers to the time spent on converting IRModule to .cu source file
c. 'compile' refers to the time spent on compiling .cu source file to object files

'compile_opt' is the optimized branch.

vadiklyutiy · 2024-02-19T20:32:34Z

Useful data, thx @hunbssfy
Could you make some clarifications

How did you measure? Put a time counters around functions? If so, could you refer what function you covered.
Could you add for reference the full compilation time of that test, if you have.
To be honest I did not get why lower and codegen speeded up. Seems you did not touch them.

maxyanghu · 2024-02-19T21:21:55Z

@vadiklyutiy

python/hidet/drivers/build_module.py: build_ir_module. if you look into the code in this function, you can easily find the three functions(lower/codegen/compile) I measured. I used python module 'time' to measure the function calling time. In the multiprocessing scenario, I use a shared value(with a lock) to accumulate all the time used. The time in the sheet is the total time summed from all the processor's individual time.
I didn't measure the full compilation time because I didn't modify other code except the parts in this function. But I will do a full breakdown of the compilation time later this week, because right now I'm not very familiar with the whole compilation process yet.
I'm not quite sure why this is the case either. But if you look at the code that I changed, I did change the function build_ir_module's signature, it can now take a sequence of IRModule and do lower/codegen/compile in a batch. Previously, this function can only take an IRModule at a time. I assume there might be some overhead in the measurement itself and python interpreter. In the batch compilation case, we measure the time to lower/codegen/compile in a batch style while in the original code we measure them for each task.

vadiklyutiy · 2024-02-19T22:33:54Z

' 1. Thanks for clarification. How many CPU thread did you have on machine where you did measure?
' 3. Seems like packing in a batches impact also speed up "lower". For me a bit unexpected but possible.

BTW From you spreadsheet. It is clear that if we have more candidates we have bigger improvement. matmul, conv, attn have similar number of candidates. For matmul and conv improvement 8.05x and 7.22x. But attn has only 3.83x. Maybe it's worth to look into it. Maybe the reason just because attn has much more code in .cu file but maybe we have some misprint or didn't take into account something.

maxyanghu · 2024-02-20T15:42:45Z

@vadiklyutiy I used 32 cores so that is 32 threads.

I also noticed this unbalanced acceleration percentage. I'll post new discoveries if I got any.

yaoyaoding · 2024-02-23T07:41:33Z

Thanks @hunbssfy @vadiklyutiy !

What if different IRModules have functions with the same name but different implementations? For example, we might have a function in the schedule template which will have different implementations for different hyper-parameters. When we merge the generated code for multiple IRModules, there might be name conflict issue.

vadiklyutiy · 2024-02-23T16:19:41Z

Thanks @hunbssfy @vadiklyutiy !

What if different IRModules have functions with the same name but different implementations? For example, we might have a function in the schedule template which will have different implementations for different hyper-parameters. When we merge the generated code for multiple IRModules, there might be name conflict issue.

But as I understand finally they are linked to the lib.o. Does this issue exist in current implementation?

vadiklyutiy · 2024-02-26T14:14:12Z

@yaoyaoding could you response

yaoyaoding · 2024-02-26T19:48:03Z

But as I understand finally they are linked to the lib.o. Does this issue exist in current implementation?

They are private functions in each lib.a and the exposed public functions (e.g., those functions with API macro at the head of function declaraion like launch_1, launch_2) have different names. Thus current implementation does not have this problem.

maxyanghu · 2024-02-27T19:13:59Z

Thanks @hunbssfy @vadiklyutiy !

What if different IRModules have functions with the same name but different implementations? For example, we might have a function in the schedule template which will have different implementations for different hyper-parameters. When we merge the generated code for multiple IRModules, there might be name conflict issue.

@yaoyaoding I went through the generated .cu file during compilation. Here is one of them:

namespace candidate_0 {
static __device__ __forceinline__ void cuda_mma_sync_aligned_m16n8k8_row_col_f32_tf32_tf32_f32(tfloat32_t * __restrict__ a, tfloat32_t * __restrict__ b, float * __restrict__ c) {
...
}
static __global__ void __launch_bounds__(64) batch_matmul_kernel(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
...
}
void launch(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
...
}
}

namespace candidate_1 {
// two static 'internal' functions
// launch function
}

namespace candidate_2 {
...
}
....

As you can see, all candidates have a launch function. But they dwell finely within separate namespaces. This patch didn't change the namespaces of the candidates which means even though candidates are 'grouped' into different .cu files, their namespace names still map to their initial candidate numbers. For candidate 1 that's namespace candidate_0. for candidate 700 that's namespace candidate_700. So there will be no namespace conflicts. I also looked into the generated .so file's symbols, here is what they looked like:

0000000000040c60 T ZN11candidate_06launchEPfS0_S0
000000000003eb60 T ZN11candidate_16launchEPfS0_S0
0000000000040000 T ZN11candidate_26launchEPfS0_S0
0000000000040420 T ZN11candidate_36launchEPfS0_S0
0000000000040840 T ZN11candidate_46launchEPfS0_S0
000000000003ef80 T ZN11candidate_56launchEPfS0_S0
0000000000041ce0 T ZN11candidate_66launchEPfS0_S0
000000000003fbe0 T ZN11candidate_76launchEPfS0_S0
000000000003f7c0 T ZN11candidate_86launchEPfS0_S0
000000000003f3a0 T ZN11candidate_96launchEPfS0_S0
0000000000043180 T ZN12candidate_106launchEPfS0_S0
0000000000041080 T ZN12candidate_116launchEPfS0_S0
000000000003e320 T ZN12candidate_126launchEPfS0_S0
0000000000042d60 T ZN12candidate_136launchEPfS0_S0
00000000000435a0 T ZN12candidate_146launchEPfS0_S0
0000000000042520 T ZN12candidate_156launchEPfS0_S0
000000000003e740 T ZN12candidate_166launchEPfS0_S0
00000000000414a0 T ZN12candidate_176launchEPfS0_S0
0000000000042940 T ZN12candidate_186launchEPfS0_S0
00000000000418c0 T ZN12candidate_196launchEPfS0_S0
0000000000042100 T ZN12candidate_206launchEPfS0_S0
00000000000478e0 T ZN12candidate_216launchEPfS0_S0
00000000000457e0 T ZN12candidate_226launchEPfS0_S0
0000000000046c80 T ZN12candidate_236launchEPfS0_S0
00000000000470a0 T ZN12candidate_246launchEPfS0_S0
00000000000474c0 T ZN12candidate_256launchEPfS0_S0
0000000000045c00 T ZN12candidate_266launchEPfS0_S0
0000000000048960 T ZN12candidate_276launchEPfS0_S0
0000000000046860 T ZN12candidate_286launchEPfS0_S0
0000000000046440 T ZN12candidate_296launchEPfS0_S0
0000000000046020 T ZN12candidate_306launchEPfS0_S0
...

They look normal. So I don't think there will be a function name conflict issue.

maxyanghu · 2024-02-27T19:22:42Z

But as I understand finally they are linked to the lib.o. Does this issue exist in current implementation?

They are private functions in each lib.a and the exposed public functions (e.g., those functions with API macro at the head of function declaraion like launch_1, launch_2) have different names. Thus current implementation does not have this problem.

@yaoyaoding I inspected the final lib.so file. There are two types of APIs which are GLOBAL. The first one is the launch function of each candidates:

3171: 00000000000cb5d0 1043 FUNC GLOBAL DEFAULT 14 candidate_460::launch(float*, float*, float*)

the second one is these hidet_launch functions:

3199: 00000000001190f0 9 FUNC GLOBAL DEFAULT 14 hidet_launch_229

vadiklyutiy · 2024-02-28T11:06:10Z

I read all comments again and seems there is some misunderstanding.
Max in his patch compiles together several candidates of the same operator. Hence all it happens for the same IRModule. He did not try to compile several ops (IRModules) together.

So, this

What if different IRModules have functions with the same name but different implementations? For example, we might have a function in the schedule template which will have different implementations for different hyper-parameters. When we merge the generated code for multiple IRModules, there might be name conflict issue.

cannot happen. Does it make sense?

yaoyaoding · 2024-03-04T17:00:10Z

Hence all it happens for the same IRModule. He did not try to compile several ops (IRModules) together.

Even the candidates are fot the same operator, they are represented by different IRModules.

The problem that I previously worried about will not happen in the case that all IRModules that will be batch compiled are coming from the same operator as we will set different namespace for different IRModules. I used namespace to avoid name conflict in the shared library when I group all the kernels in the same shared library (previously, they are in diffferent shared libraries). This alleviates the potential problem of function name conflict problem.

There is still potential name conflict problem if the batch compile API (i.e., build_ir_module function) is used to compile IRModules with name conflict. As there is no such use case in hidet for now, we can proceed without dealing with this problem. But, @hunbssfy could you add some assertion in the function to make sure that there is no name conflict before we group (i.e., make sure each "namespace::function_name" is unique).

maxyanghu · 2024-03-04T21:48:26Z

Hence all it happens for the same IRModule. He did not try to compile several ops (IRModules) together.

Even the candidates are fot the same operator, they are represented by different IRModules.

The problem that I previously worried about will not happen in the case that all IRModules that will be batch compiled are coming from the same operator as we will set different namespace for different IRModules. I used namespace to avoid name conflict in the shared library when I group all the kernels in the same shared library (previously, they are in diffferent shared libraries). This alleviates the potential problem of function name conflict problem.

There is still potential name conflict problem if the batch compile API (i.e., build_ir_module function) is used to compile IRModules with name conflict. As there is no such use case in hidet for now, we can proceed without dealing with this problem. But, @hunbssfy could you add some assertion in the function to make sure that there is no name conflict before we group (i.e., make sure each "namespace::function_name" is unique).

Hi @yaoyaoding , I added some unqiueness function name check after the regrouping of IRModule

Allow prologue for fp32 `reduce`. `reduce` uses vectorized calculations that don't allow to use fusing(it is possible but not implemented yet). For fp32 there are no vectors and we can enable fusion (with small modification `reduce` kernels itself). Motivation. In llama2 the part of the calculation is fp32 including `pow`+`reduce`. Performance improvement on llama2-7B +0.241%

maxyanghu added the enhancement New feature or request label Feb 13, 2024

maxyanghu requested review from wangshangsam and yaoyaoding February 13, 2024 22:30

maxyanghu self-assigned this Feb 13, 2024

vadiklyutiy reviewed Feb 14, 2024

View reviewed changes

python/hidet/drivers/build_module.py Outdated Show resolved Hide resolved

maxyanghu force-pushed the compile_opt branch from 529030c to c5fc4db Compare February 15, 2024 17:18

maxyanghu force-pushed the compile_opt branch from c5fc4db to 403a108 Compare February 15, 2024 17:39

[Compilation] Optimization for compilation process

51fb21a

optimize the compilation behavior for each core. Instead of generating a cuda source file for each core, congragate them into a single source file to reduce compilation overhead

maxyanghu force-pushed the compile_opt branch from 403a108 to 51fb21a Compare February 15, 2024 17:53

maxyanghu marked this pull request as ready for review February 15, 2024 17:54

[Compilation] Check uniqueness of function names after regrouping

8d719e8

maxyanghu force-pushed the compile_opt branch from bcb57b2 to 8d719e8 Compare March 4, 2024 21:47

wangshangsam closed this Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Compilation] Optimization for compilation process #426

[Compilation] Optimization for compilation process #426

maxyanghu commented Feb 13, 2024 •

edited

Loading

vadiklyutiy commented Feb 15, 2024

maxyanghu commented Feb 15, 2024

vadiklyutiy commented Feb 15, 2024

maxyanghu commented Feb 19, 2024

vadiklyutiy commented Feb 19, 2024

maxyanghu commented Feb 19, 2024 •

edited

Loading

vadiklyutiy commented Feb 19, 2024 •

edited

Loading

maxyanghu commented Feb 20, 2024

yaoyaoding commented Feb 23, 2024

vadiklyutiy commented Feb 23, 2024

vadiklyutiy commented Feb 26, 2024

yaoyaoding commented Feb 26, 2024 •

edited

Loading

maxyanghu commented Feb 27, 2024 •

edited

Loading

maxyanghu commented Feb 27, 2024 •

edited

Loading

vadiklyutiy commented Feb 28, 2024

yaoyaoding commented Mar 4, 2024

maxyanghu commented Mar 4, 2024

[Compilation] Optimization for compilation process #426

[Compilation] Optimization for compilation process #426

Conversation

maxyanghu commented Feb 13, 2024 • edited Loading

vadiklyutiy commented Feb 15, 2024

maxyanghu commented Feb 15, 2024

vadiklyutiy commented Feb 15, 2024

maxyanghu commented Feb 19, 2024

vadiklyutiy commented Feb 19, 2024

maxyanghu commented Feb 19, 2024 • edited Loading

vadiklyutiy commented Feb 19, 2024 • edited Loading

maxyanghu commented Feb 20, 2024

yaoyaoding commented Feb 23, 2024

vadiklyutiy commented Feb 23, 2024

vadiklyutiy commented Feb 26, 2024

yaoyaoding commented Feb 26, 2024 • edited Loading

maxyanghu commented Feb 27, 2024 • edited Loading

maxyanghu commented Feb 27, 2024 • edited Loading

vadiklyutiy commented Feb 28, 2024

yaoyaoding commented Mar 4, 2024

maxyanghu commented Mar 4, 2024

maxyanghu commented Feb 13, 2024 •

edited

Loading

maxyanghu commented Feb 19, 2024 •

edited

Loading

vadiklyutiy commented Feb 19, 2024 •

edited

Loading

yaoyaoding commented Feb 26, 2024 •

edited

Loading

maxyanghu commented Feb 27, 2024 •

edited

Loading

maxyanghu commented Feb 27, 2024 •

edited

Loading