-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Opt] CFG optimization scales superlinearly #1785
Comments
Log during debug:
|
Sounds like a nice solution! We only need to traverse the |
Well... it's not that trivial.
When replacing |
OIC... traverse the |
After #1789:
|
#1789 significantly improved the monolithic kernel version: 6m -> 3.6min. I believe there is still some space for improvement :-) I did some profiling: Monolithic: (3.648 min)
Breakdown: (1.025 min)
|
Some performance analysis from
Compiled with
|
If We can add something like |
Exactly. It may be hard to maintain a DU/UD chain that is consistent with the IR through the compilation process, given how Taichi is designed at this point. However, every time before we have a large number of invocation of |
Sounds good! But even if we only use the structure in |
Describe the issue
As we get more and more users, some crazy usages of Taichi emerge. For example, @squarefk reported a case that generates 100K statements after AST lower that takes 6 min+ to compile: https://github.com/yuanming-hu/taichi/blob/ipc/newton_ipc_whole.py#L147
More interestingly, if we break the kernel down into a few smaller kernels (https://github.com/yuanming-hu/taichi/blob/ipc/newton_ipc_breakdown.py#L147), compilation only takes ~1 min.
To Reproduce
Just download three files
Run the two executable files to see the compilation time difference.
Log/Screenshots
Monolithic kernel: https://github.com/yuanming-hu/taichi/blob/ipc/time_whole.txt
Break down the kernel into smaller pieces: https://github.com/yuanming-hu/taichi/blob/ipc/time_breakdown.txt
AST lowering also scales superlinearly (which is expected). Is there any possibility to make CFG optimization scale (close to) linearly?
The text was updated successfully, but these errors were encountered: