Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Opt] CFG optimization scales superlinearly #1785

Open
yuanming-hu opened this issue Aug 27, 2020 · 10 comments · Fixed by #1789
Open

[Opt] CFG optimization scales superlinearly #1785

yuanming-hu opened this issue Aug 27, 2020 · 10 comments · Fixed by #1789
Assignees
Labels
advanced optimization The issue or bug is related to advanced optimization

Comments

@yuanming-hu
Copy link
Member

Describe the issue
As we get more and more users, some crazy usages of Taichi emerge. For example, @squarefk reported a case that generates 100K statements after AST lower that takes 6 min+ to compile: https://github.com/yuanming-hu/taichi/blob/ipc/newton_ipc_whole.py#L147

More interestingly, if we break the kernel down into a few smaller kernels (https://github.com/yuanming-hu/taichi/blob/ipc/newton_ipc_breakdown.py#L147), compilation only takes ~1 min.

To Reproduce
Just download three files

Run the two executable files to see the compilation time difference.

Log/Screenshots

Monolithic kernel: https://github.com/yuanming-hu/taichi/blob/ipc/time_whole.txt
Break down the kernel into smaller pieces: https://github.com/yuanming-hu/taichi/blob/ipc/time_breakdown.txt

AST lowering also scales superlinearly (which is expected). Is there any possibility to make CFG optimization scale (close to) linearly?

@yuanming-hu yuanming-hu added the advanced optimization The issue or bug is related to advanced optimization label Aug 27, 2020
@xumingkuan
Copy link
Collaborator

Log during debug:

3.098  m 56.73%  taichi::lang::irpass::cfg_optimization [5 x  37.182  s]
    2.290  m 73.90%  taichi::lang::ControlFlowGraph::store_to_load_forwarding [8 x  17.173  s]
        0.371  m 16.22%  taichi::lang::ControlFlowGraph::reaching_definition_analysis [8 x   2.785  s]
        1.771  m 77.35%  taichi::lang::CFGNode::store_to_load_forwarding [552 x 192.506 ms]
            0.340  m 19.19%  taichi::lang::CFGNode::get_store_forwarding_data [38957 x 523.385 us]
                3.960  s 19.42%  taichi::lang::get_last_def_position [38957 x 101.650 us]
               16.430  s 80.58%  [unaccounted]
            1.429  m 80.70%  taichi::lang::CFGNode::store_to_load_forwarding::<lambda_545caafaedb0e029b916eca97c063686>::operator () [26745 x   3.206 ms]
                1.424  m 99.63%  taichi::lang::Stmt::replace_with [26745 x   3.194 ms]
        0.147  m  6.43%  taichi::lang::CFGNode::identical_store_elimination [493 x  17.928 ms]
            8.814  s 99.72%  taichi::lang::CFGNode::get_store_forwarding_data [19938 x 442.065 us]
                5.199  s 58.99%  taichi::lang::get_last_def_position [19938 x 260.779 us]
                3.614  s 41.01%  [unaccounted]
    0.798  m 25.74%  taichi::lang::ControlFlowGraph::dead_store_elimination [8 x   5.982  s]
       39.507  s 82.55%  taichi::lang::ControlFlowGraph::live_variable_analysis [8 x   4.938  s]
        6.130  s 12.81%  taichi::lang::Stmt::replace_with [2419 x   2.534 ms]
        2.222  s  4.64%  [unaccounted]
    0.007  m  0.23%  taichi::lang::irpass::die [5 x  83.714 ms]

1.424 m out of 3.098 m's time spends on Stmt::replace_with(Stmt *), which is a crucial operation upon elimination of a statement during store-to-load forwarding. Maybe we can make Stmt::replace_with(Stmt *) bottom-up instead of top-down to avoid traversing unrelated offloads?

@yuanming-hu
Copy link
Member Author

1.424 m out of 3.098 m's time spends on Stmt::replace_with(Stmt *), which is a crucial operation upon elimination of a statement during store-to-load forwarding. Maybe we can make Stmt::replace_with(Stmt *) bottom-up instead of top-down to avoid traversing unrelated offloads?

Sounds like a nice solution! We only need to traverse the Block that directly contains the statement, since the statement is not even visible to other parts of the IR :-)

@xumingkuan
Copy link
Collaborator

Well... it's not that trivial.

kernel {
$1 = arg[0]
$2 = arg[0]
$3 : if $1 {
  $4 = global ptr index [$2]
}
}

When replacing $2 with $1, we need to dive into $3.

@xumingkuan
Copy link
Collaborator

OIC... traverse the Block that directly contains the statement is enough.

@xumingkuan
Copy link
Collaborator

After #1789:

1.263  m 44.31%  taichi::lang::irpass::cfg_optimization [5 x  15.157  s]
0.720  m 56.97%  taichi::lang::ControlFlowGraph::store_to_load_forwarding [7 x   6.168  s]
   15.596  s 36.12%  taichi::lang::ControlFlowGraph::reaching_definition_analysis [7 x   2.228  s]
    5.711  s 13.23%  taichi::lang::irpass::replace_all_usages_with [26745 x 213.535 us]
   21.867  s 50.65%  [unaccounted]
0.534  m 42.30%  taichi::lang::ControlFlowGraph::dead_store_elimination [7 x   4.579  s]
   30.094  s 93.88%  taichi::lang::ControlFlowGraph::live_variable_analysis [7 x   4.299  s]
    0.294  s  0.92%  taichi::lang::irpass::replace_all_usages_with [2419 x 121.356 us]
    1.668  s  5.20%  [unaccounted]
0.006  m  0.44%  taichi::lang::irpass::die [5 x  66.122 ms]

@yuanming-hu
Copy link
Member Author

#1789 significantly improved the monolithic kernel version: 6m -> 3.6min. I believe there is still some space for improvement :-)

I did some profiling:

Monolithic: (3.648 min)

[Profiler thread 139902860150592]
    232.932 ms clone_runtime_module          [1 x 232.932 ms]
        221.446 ms 95.07%  compile_runtime_bitcode [1 x 221.446 ms]
          8.137 ms  3.49%  module_from_bitcode_file [1 x   8.137 ms]
          3.318 ms  1.42%  clone module          [1 x   3.318 ms]
    109.512 ms run                           [1 x 109.512 ms]
          0.058 ms  0.05%  generate_types        [79 x 736.381 ns]
          0.242 ms  0.22%  generate_child_accessors [1 x 241.995 us]
             13.113 us  5.42%  generate_refine_coordinates [1 x  13.113 us]
            227.928 us 94.19%  generate_child_accessors [25 x   9.117 us]
                129.700 us 56.90%  generate_refine_coordinates [25 x   5.188 us]
                 60.320 us 26.46%  generate_child_accessors [53 x   1.138 us]
                 37.909 us 16.63%  [unaccounted]
          3.121 ms  2.85%  clone_struct_module   [1 x   3.121 ms]
          0.680 ms  0.62%  eliminate_unused_functions [1 x 680.208 us]
        100.982 ms 92.21%  global_optimize_module_cpu [1 x 100.982 ms]
              2.074 ms  2.05%  llvm_function_pass    [1 x   2.074 ms]
             97.335 ms 96.39%  llvm_module_pass      [1 x  97.335 ms]
              1.573 ms  1.56%  [unaccounted]
          4.429 ms  4.04%  [unaccounted]
      3.648  m compile                       [1 x   3.648  m]
          3.624  m 99.32%  compile_to_executable [1 x   3.624  m]
              2.422  m 66.83%  compile_to_offloads   [1 x   2.422  m]
                  0.302  m 12.49%  lower_ast             [1 x  18.141  s]
                      3.926  s 21.64%  replace_all_usages_with [28649 x 137.045 us]
                     14.215  s 78.36%  [unaccounted]
                  0.002  m  0.07%  type_check            [1 x  98.246 ms]
                  0.001  m  0.05%  verify                [9 x   7.831 ms]
                  0.000  m  0.00%  loop_vectorize        [1 x   2.469 ms]
                  0.000  m  0.00%  vector_split          [1 x   1.483 ms]
                  2.035  m 84.05%  full_simplify         [3 x  40.706  s]
                      0.098  m  4.83%  extract_constant      [7 x 841.862 ms]
                      0.000  m  0.01%  unreachable_code_elimination [7 x   1.660 ms]
                      0.000  m  0.01%  binary_op_simplify    [7 x 989.641 us]
                      0.005  m  0.26%  constant_fold         [7 x  45.666 ms]
                        280.449 ms 87.73%  replace_all_usages_with [1007 x 278.500 us]
                         16.371 ms  5.12%  compile               [2 x   8.186 ms]
                              0.113 ms  0.69%  compile_to_executable [2 x  56.505 us]
                                 53.167 us 47.05%  compile_to_offloads   [2 x  26.584 us]
                                      3.338 us  6.28%  lower_ast             [2 x   1.669 us]
                                      3.815 us  7.17%  type_check            [2 x   1.907 us]
                                      3.815 us  7.17%  verify                [4 x 953.674 ns]
                                      1.192 us  2.24%  demote_operations     [2 x 596.046 ns]
                                     38.862 us 73.09%  offload               [2 x  19.431 us]
                                          2.146 us  5.52%  type_check            [4 x 536.442 ns]
                                         36.716 us 94.48%  [unaccounted]
                                      2.146 us  4.04%  [unaccounted]
                                 56.982 us 50.42%  offload_to_executable [2 x  28.491 us]
                                      1.907 us  3.35%  type_check            [6 x 317.891 ns]
                                      6.199 us 10.88%  verify                [12 x 516.574 ns]
                                      2.146 us  3.77%  make_thread_local     [2 x   1.073 us]
                                          0.000 us  0.00%  type_check            [2 x   0.000 ns]
                                          2.146 us 100.00%  [unaccounted]
                                      1.907 us  3.35%  remove_range_assumption [2 x 953.674 ns]
                                      1.907 us  3.35%  die                   [2 x 953.674 ns]
                                      0.000 us  0.00%  flag_access           [2 x   0.000 ns]
                                      1.907 us  3.35%  demote_atomics        [2 x 953.674 ns]
                                          0.954 us 50.00%  type_check            [2 x 476.837 ns]
                                          0.954 us 50.00%  [unaccounted]
                                      0.954 us  1.67%  demote_operations     [2 x 476.837 ns]
                                     38.147 us 66.95%  full_simplify         [2 x  19.073 us]
                                          0.000 us  0.00%  extract_constant      [2 x   0.000 ns]
                                          0.954 us  2.50%  unreachable_code_elimination [2 x 476.837 ns]
                                          0.000 us  0.00%  binary_op_simplify    [2 x   0.000 ns]
                                          1.192 us  3.12%  constant_fold         [2 x 596.046 ns]
                                          1.907 us  5.00%  die                   [6 x 317.891 ns]
                                          0.954 us  2.50%  alg_simp              [2 x 476.837 ns]
                                          0.954 us  2.50%  simplify              [2 x 476.837 ns]
                                          4.292 us 11.25%  whole_kernel_cse      [2 x   2.146 us]
                                         22.888 us 60.00%  cfg_optimization      [2 x  11.444 us]
                                              8.106 us 35.42%  store_to_load_forwarding [2 x   4.053 us]
                                                  4.053 us 50.00%  reaching_definition_analysis [2 x   2.027 us]
                                                  4.053 us 50.00%  [unaccounted]
                                              8.106 us 35.42%  dead_store_elimination [2 x   4.053 us]
                                                  4.053 us 50.00%  live_variable_analysis [2 x   2.027 us]
                                                  4.053 us 50.00%  [unaccounted]
                                              0.000 us  0.00%  die                   [2 x   0.000 ns]
                                              6.676 us 29.17%  [unaccounted]
                                          5.007 us 13.12%  [unaccounted]
                                      1.907 us  3.35%  [unaccounted]
                                  2.861 us  2.53%  [unaccounted]
                             16.248 ms 99.25%  compile               [2 x   8.124 ms]
                                 16.246 ms 99.99%  codegen               [2 x   8.123 ms]
                                      6.645 ms 40.90%  clone_struct_module   [2 x   3.322 ms]
                                      0.000 ms  0.00%  CodeGenLLVMCPU        [2 x   0.000 ns]
                                      0.051 ms  0.31%  emit_to_module        [2 x  25.511 us]
                                      9.533 ms 58.68%  compile_module_to_executable [2 x   4.766 ms]
                                          1.668 ms 17.50%  eliminate_unused_functions [2 x 834.107 us]
                                          6.247 ms 65.53%  global_optimize_module_cpu [2 x   3.123 ms]
                                              0.900 ms 14.40%  llvm_function_pass    [2 x 449.896 us]
                                              3.440 ms 55.07%  llvm_module_pass      [2 x   1.720 ms]
                                              1.907 ms 30.53%  [unaccounted]
                                          1.618 ms 16.97%  [unaccounted]
                         22.839 ms  7.14%  [unaccounted]
                      0.003  m  0.13%  die                   [21 x   7.609 ms]
                      0.016  m  0.76%  alg_simp              [7 x 133.346 ms]
                        884.548 ms 94.76%  replace_all_usages_with [3314 x 266.913 us]
                         48.871 ms  5.24%  [unaccounted]
                      0.000  m  0.02%  simplify              [7 x   3.305 ms]
                          5.263 ms 22.75%  replace_all_usages_with [32 x 164.457 us]
                         17.870 ms 77.25%  [unaccounted]
                      0.891  m 43.77%  whole_kernel_cse      [5 x  10.691  s]
                         20.715  s 38.75%  replace_all_usages_with [20129 x   1.029 ms]
                         32.740  s 61.25%  [unaccounted]
                      1.022  m 50.21%  cfg_optimization      [5 x  12.263  s]
                          0.561  m 54.87%  store_to_load_forwarding [7 x   4.807  s]
                             13.148  s 39.07%  reaching_definition_analysis [7 x   1.878  s]
                              4.213  s 12.52%  replace_all_usages_with [26745 x 157.541 us]
                             16.286  s 48.40%  [unaccounted]
                          0.456  m 44.60%  dead_store_elimination [7 x   3.907  s]
                             25.724  s 94.06%  live_variable_analysis [7 x   3.675  s]
                              0.206  s  0.75%  replace_all_usages_with [2419 x  85.032 us]
                              1.417  s  5.18%  [unaccounted]
                          0.003  m  0.30%  die                   [5 x  36.463 ms]
                  0.000  m  0.00%  flag_access           [2 x 365.496 us]
                  0.031  m  1.27%  offload               [1 x   1.843  s]
                      0.000  s  0.02%  replace_all_usages_with [7 x  63.385 us]
                      0.030  s  1.61%  type_check            [2 x  14.811 ms]
                      1.813  s 98.37%  [unaccounted]
                  0.050  m  2.08%  cfg_optimization      [1 x   3.016  s]
                      1.736  s 57.56%  store_to_load_forwarding [1 x   1.736  s]
                          1.161  s 66.88%  reaching_definition_analysis [1 x   1.161  s]
                          0.575  s 33.12%  [unaccounted]
                      1.248  s 41.39%  dead_store_elimination [1 x   1.248  s]
                          1.008  s 80.72%  live_variable_analysis [1 x   1.008  s]
                          0.241  s 19.28%  [unaccounted]
                      0.006  s  0.20%  die                   [1 x   5.953 ms]
                      0.026  s  0.86%  [unaccounted]
              1.202  m 33.17%  offload_to_executable [1 x   1.202  m]
                  0.001  m  0.06%  type_check            [3 x  13.612 ms]
                  0.001  m  0.06%  verify                [6 x   7.497 ms]
                  0.001  m  0.05%  make_thread_local     [1 x  38.603 ms]
                     13.164 ms 34.10%  type_check            [1 x  13.164 ms]
                     25.439 ms 65.90%  [unaccounted]
                  0.000  m  0.00%  remove_range_assumption [1 x 106.096 us]
                  0.000  m  0.03%  die                   [1 x  18.679 ms]
                  0.000  m  0.00%  flag_access           [1 x   1.185 ms]
                  0.001  m  0.12%  demote_atomics        [1 x  86.074 ms]
                     55.050 ms 63.96%  replace_all_usages_with [837 x  65.771 us]
                     18.401 ms 21.38%  type_check            [1 x  18.401 ms]
                     12.623 ms 14.67%  [unaccounted]
                  0.000  m  0.00%  demote_operations     [1 x 826.120 us]
                  1.198  m 99.63%  full_simplify         [1 x   1.198  m]
                      0.000  m  0.00%  extract_constant      [10 x 338.769 us]
                      0.000  m  0.01%  unreachable_code_elimination [10 x 718.093 us]
                      0.002  m  0.15%  binary_op_simplify    [10 x  10.692 ms]
                         82.454 ms 77.12%  replace_all_usages_with [2780 x  29.660 us]
                         24.468 ms 22.88%  [unaccounted]
                      0.002  m  0.18%  constant_fold         [10 x  12.625 ms]
                         87.482 ms 69.29%  replace_all_usages_with [2879 x  30.386 us]
                         17.913 ms 14.19%  compile               [2 x   8.956 ms]
                              0.129 ms  0.72%  compile_to_executable [2 x  64.492 us]
                                 55.075 us 42.70%  compile_to_offloads   [2 x  27.537 us]
                                      2.384 us  4.33%  lower_ast             [2 x   1.192 us]
                                      5.722 us 10.39%  type_check            [2 x   2.861 us]
                                      5.007 us  9.09%  verify                [4 x   1.252 us]
                                      0.954 us  1.73%  demote_operations     [2 x 476.837 ns]
                                     37.909 us 68.83%  offload               [2 x  18.954 us]
                                          4.053 us 10.69%  type_check            [4 x   1.013 us]
                                         33.855 us 89.31%  [unaccounted]
                                      3.099 us  5.63%  [unaccounted]
                                 72.002 us 55.82%  offload_to_executable [2 x  36.001 us]
                                      4.053 us  5.63%  type_check            [6 x 675.519 ns]
                                      4.768 us  6.62%  verify                [12 x 397.364 ns]
                                      0.954 us  1.32%  make_thread_local     [2 x 476.837 ns]
                                        953.674 ns type_check                    [2 x 476.837 ns]
                                      0.000 us  0.00%  remove_range_assumption [2 x   0.000 ns]
                                      5.245 us  7.28%  die                   [2 x   2.623 us]
                                      0.954 us  1.32%  flag_access           [2 x 476.837 ns]
                                      1.907 us  2.65%  demote_atomics        [2 x 953.674 ns]
                                          0.954 us 50.00%  type_check            [2 x 476.837 ns]
                                          0.954 us 50.00%  [unaccounted]
                                      1.192 us  1.66%  demote_operations     [2 x 596.046 ns]
                                     46.968 us 65.23%  full_simplify         [2 x  23.484 us]
                                          1.192 us  2.54%  extract_constant      [2 x 596.046 ns]
                                          0.954 us  2.03%  unreachable_code_elimination [2 x 476.837 ns]
                                          1.907 us  4.06%  binary_op_simplify    [2 x 953.674 ns]
                                          0.954 us  2.03%  constant_fold         [2 x 476.837 ns]
                                          4.053 us  8.63%  die                   [6 x 675.519 ns]
                                          0.000 us  0.00%  alg_simp              [2 x   0.000 ns]
                                          2.146 us  4.57%  simplify              [2 x   1.073 us]
                                          5.960 us 12.69%  whole_kernel_cse      [2 x   2.980 us]
                                         27.895 us 59.39%  cfg_optimization      [2 x  13.947 us]
                                              7.153 us 25.64%  store_to_load_forwarding [2 x   3.576 us]
                                                  6.199 us 86.67%  reaching_definition_analysis [2 x   3.099 us]
                                                  0.954 us 13.33%  [unaccounted]
                                             10.967 us 39.32%  dead_store_elimination [2 x   5.484 us]
                                                  5.960 us 54.35%  live_variable_analysis [2 x   2.980 us]
                                                  5.007 us 45.65%  [unaccounted]
                                              0.000 us  0.00%  die                   [2 x   0.000 ns]
                                              9.775 us 35.04%  [unaccounted]
                                          1.907 us  4.06%  [unaccounted]
                                      5.960 us  8.28%  [unaccounted]
                                  1.907 us  1.48%  [unaccounted]
                             17.774 ms 99.23%  compile               [2 x   8.887 ms]
                                 17.774 ms 100.00%  codegen              [2 x   8.887 ms]
                                      7.337 ms 41.28%  clone_struct_module   [2 x   3.669 ms]
                                      0.001 ms  0.01%  CodeGenLLVMCPU        [2 x 476.837 ns]
                                      0.055 ms  0.31%  emit_to_module        [2 x  27.537 us]
                                     10.369 ms 58.34%  compile_module_to_executable [2 x   5.185 ms]
                                          1.866 ms 17.99%  eliminate_unused_functions [2 x 932.932 us]
                                          6.814 ms 65.71%  global_optimize_module_cpu [2 x   3.407 ms]
                                              0.954 ms 14.00%  llvm_function_pass    [2 x 477.076 us]
                                              3.820 ms 56.06%  llvm_module_pass      [2 x   1.910 ms]
                                              2.040 ms 29.93%  [unaccounted]
                                          1.689 ms 16.29%  [unaccounted]
                         20.857 ms 16.52%  [unaccounted]
                      0.002  m  0.20%  die                   [30 x   4.715 ms]
                      0.002  m  0.14%  alg_simp              [10 x   9.963 ms]
                         88.913 ms 89.24%  replace_all_usages_with [2719 x  32.701 us]
                         10.716 ms 10.76%  [unaccounted]
                      0.287  m 24.00%  simplify              [10 x   1.724  s]
                          0.974  s  5.65%  replace_all_usages_with [13490 x  72.220 us]
                         15.122  s 87.69%  type_check            [5488 x   2.755 ms]
                          1.148  s  6.66%  [unaccounted]
                      0.897  m 74.93%  whole_kernel_cse      [9 x   5.983  s]
                          3.984  s  7.40%  replace_all_usages_with [31273 x 127.399 us]
                         49.859  s 92.60%  [unaccounted]
                      0.005  m  0.40%  cfg_optimization      [9 x  31.955 ms]
                        123.532 ms 42.95%  store_to_load_forwarding [10 x  12.353 ms]
                             28.977 ms 23.46%  reaching_definition_analysis [10 x   2.898 ms]
                             58.298 ms 47.19%  replace_all_usages_with [1660 x  35.119 us]
                             36.256 ms 29.35%  [unaccounted]
                        128.024 ms 44.52%  dead_store_elimination [10 x  12.802 ms]
                             61.522 ms 48.06%  live_variable_analysis [10 x   6.152 ms]
                             66.502 ms 51.94%  [unaccounted]
                         31.756 ms 11.04%  die                   [9 x   3.528 ms]
                          4.281 ms  1.49%  [unaccounted]
          0.025  m  0.68%  compile               [1 x   1.489  s]
              1.489  s 100.00%  codegen              [1 x   1.489  s]
                  0.003  s  0.23%  clone_struct_module   [1 x   3.453 ms]
                  0.000  s  0.00%  CodeGenLLVMCPU        [1 x 953.674 ns]
                  0.017  s  1.14%  emit_to_module        [1 x  17.038 ms]
                  1.468  s 98.57%  compile_module_to_executable [1 x   1.468  s]
                      0.003  s  0.23%  eliminate_unused_functions [1 x   3.390 ms]
                      1.420  s 96.74%  global_optimize_module_cpu [1 x   1.420  s]
                          0.021  s  1.47%  llvm_function_pass    [1 x  20.880 ms]
                          1.388  s 97.70%  llvm_module_pass      [1 x   1.388  s]
                          0.012  s  0.83%  [unaccounted]
                      0.045  s  3.03%  [unaccounted]

Breakdown: (1.025 min)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[Profiler thread 140333951227712]
    236.372 ms clone_runtime_module          [1 x 236.372 ms]
        224.851 ms 95.13%  compile_runtime_bitcode [1 x 224.851 ms]
          8.239 ms  3.49%  module_from_bitcode_file [1 x   8.239 ms]
          3.254 ms  1.38%  clone module          [1 x   3.254 ms]
    110.436 ms run                           [1 x 110.436 ms]
          0.063 ms  0.06%  generate_types        [79 x 796.741 ns]
          0.280 ms  0.25%  generate_child_accessors [1 x 280.142 us]
             13.113 us  4.68%  generate_refine_coordinates [1 x  13.113 us]
            265.837 us 94.89%  generate_child_accessors [25 x  10.633 us]
                145.197 us 54.62%  generate_refine_coordinates [25 x   5.808 us]
                 75.817 us 28.52%  generate_child_accessors [53 x   1.431 us]
                 44.823 us 16.86%  [unaccounted]
          3.074 ms  2.78%  clone_struct_module   [1 x   3.074 ms]
          0.743 ms  0.67%  eliminate_unused_functions [1 x 743.151 us]
        101.740 ms 92.13%  global_optimize_module_cpu [1 x 101.740 ms]
              2.159 ms  2.12%  llvm_function_pass    [1 x   2.159 ms]
             98.080 ms 96.40%  llvm_module_pass      [1 x  98.080 ms]
              1.501 ms  1.48%  [unaccounted]
          4.536 ms  4.11%  [unaccounted]
      1.025  m compile                       [7 x   8.783  s]
          0.998  m 97.40%  compile_to_executable [7 x   8.555  s]
             33.097  s 55.27%  compile_to_offloads   [7 x   4.728  s]
                  4.626  s 13.98%  lower_ast             [7 x 660.808 ms]
                      2.773  s 59.95%  replace_all_usages_with [28649 x  96.791 us]
                      1.853  s 40.05%  [unaccounted]
                  0.101  s  0.31%  type_check            [7 x  14.444 ms]
                  0.064  s  0.19%  verify                [63 x   1.015 ms]
                  0.002  s  0.01%  loop_vectorize        [7 x 290.837 us]
                  0.001  s  0.00%  vector_split          [7 x 125.715 us]
                 27.097  s 81.87%  full_simplify         [21 x   1.290  s]
                      0.843  s  3.11%  extract_constant      [49 x  17.194 ms]
                      0.006  s  0.02%  unreachable_code_elimination [49 x 116.227 us]
                      0.005  s  0.02%  binary_op_simplify    [49 x  92.195 us]
                      0.199  s  0.73%  constant_fold         [49 x   4.060 ms]
                        163.406 ms 82.14%  replace_all_usages_with [1007 x 162.270 us]
                         16.323 ms  8.21%  compile               [2 x   8.162 ms]
                              0.100 ms  0.61%  compile_to_executable [2 x  49.949 us]
                                 43.869 us 43.91%  compile_to_offloads   [2 x  21.935 us]
                                      1.907 us  4.35%  lower_ast             [2 x 953.674 ns]
                                      2.861 us  6.52%  type_check            [2 x   1.431 us]
                                      5.245 us 11.96%  verify                [4 x   1.311 us]
                                      0.000 us  0.00%  demote_operations     [2 x   0.000 ns]
                                     31.948 us 72.83%  offload               [2 x  15.974 us]
                                          1.907 us  5.97%  type_check            [4 x 476.837 ns]
                                         30.041 us 94.03%  [unaccounted]
                                      1.907 us  4.35%  [unaccounted]
                                 56.028 us 56.09%  offload_to_executable [2 x  28.014 us]
                                      3.815 us  6.81%  type_check            [6 x 635.783 ns]
                                      5.960 us 10.64%  verify                [12 x 496.705 ns]
                                      0.000 us  0.00%  make_thread_local     [2 x   0.000 ns]
                                          0.000 ns type_check                    [2 x   0.000 ns]
                                      1.907 us  3.40%  remove_range_assumption [2 x 953.674 ns]
                                      2.146 us  3.83%  die                   [2 x   1.073 us]
                                      0.000 us  0.00%  flag_access           [2 x   0.000 ns]
                                      0.000 us  0.00%  demote_atomics        [2 x   0.000 ns]
                                          0.000 ns type_check                    [2 x   0.000 ns]
                                      0.000 us  0.00%  demote_operations     [2 x   0.000 ns]
                                     36.955 us 65.96%  full_simplify         [2 x  18.477 us]
                                          1.907 us  5.16%  extract_constant      [2 x 953.674 ns]
                                          0.000 us  0.00%  unreachable_code_elimination [2 x   0.000 ns]
                                          1.192 us  3.23%  binary_op_simplify    [2 x 596.046 ns]
                                          0.954 us  2.58%  constant_fold         [2 x 476.837 ns]
                                          2.861 us  7.74%  die                   [6 x 476.837 ns]
                                          0.954 us  2.58%  alg_simp              [2 x 476.837 ns]
                                          1.192 us  3.23%  simplify              [2 x 596.046 ns]
                                          4.053 us 10.97%  whole_kernel_cse      [2 x   2.027 us]
                                         23.842 us 64.52%  cfg_optimization      [2 x  11.921 us]
                                              5.960 us 25.00%  store_to_load_forwarding [2 x   2.980 us]
                                                  4.053 us 68.00%  reaching_definition_analysis [2 x   2.027 us]
                                                  1.907 us 32.00%  [unaccounted]
                                              8.106 us 34.00%  dead_store_elimination [2 x   4.053 us]
                                                  4.053 us 50.00%  live_variable_analysis [2 x   2.027 us]
                                                  4.053 us 50.00%  [unaccounted]
                                              0.000 us  0.00%  die                   [2 x   0.000 ns]
                                              9.775 us 41.00%  [unaccounted]
                                      5.245 us  9.36%  [unaccounted]
                             16.215 ms 99.34%  compile               [2 x   8.108 ms]
                                 16.213 ms 99.99%  codegen               [2 x   8.106 ms]
                                      6.467 ms 39.89%  clone_struct_module   [2 x   3.233 ms]
                                      0.001 ms  0.01%  CodeGenLLVMCPU        [2 x 476.837 ns]
                                      0.054 ms  0.33%  emit_to_module        [2 x  26.941 us]
                                      9.680 ms 59.70%  compile_module_to_executable [2 x   4.840 ms]
                                          1.702 ms 17.58%  eliminate_unused_functions [2 x 851.035 us]
                                          6.386 ms 65.97%  global_optimize_module_cpu [2 x   3.193 ms]
                                              0.997 ms 15.61%  llvm_function_pass    [2 x 498.414 us]
                                              3.382 ms 52.96%  llvm_module_pass      [2 x   1.691 ms]
                                              2.007 ms 31.43%  [unaccounted]
                                          1.592 ms 16.45%  [unaccounted]
                         19.209 ms  9.66%  [unaccounted]
                      0.108  s  0.40%  die                   [147 x 733.492 us]
                      0.624  s  2.30%  alg_simp              [49 x  12.741 ms]
                        578.442 ms 92.66%  replace_all_usages_with [3314 x 174.545 us]
                         45.852 ms  7.34%  [unaccounted]
                      0.011  s  0.04%  simplify              [49 x 225.111 us]
                          0.944 ms  8.56%  replace_all_usages_with [26 x  36.313 us]
                         10.086 ms 91.44%  [unaccounted]
                      9.160  s 33.81%  whole_kernel_cse      [31 x 295.500 ms]
                          1.932  s 21.09%  replace_all_usages_with [20039 x  96.399 us]
                          7.229  s 78.91%  [unaccounted]
                     16.142  s 59.57%  cfg_optimization      [31 x 520.710 ms]
                         14.169  s 87.78%  store_to_load_forwarding [41 x 345.579 ms]
                              1.765  s 12.46%  reaching_definition_analysis [41 x  43.046 ms]
                              4.221  s 29.79%  replace_all_usages_with [26745 x 157.822 us]
                              8.183  s 57.75%  [unaccounted]
                          1.819  s 11.27%  dead_store_elimination [41 x  44.354 ms]
                              0.336  s 18.48%  live_variable_analysis [41 x   8.195 ms]
                              0.206  s 11.35%  replace_all_usages_with [2419 x  85.340 us]
                              1.276  s 70.17%  [unaccounted]
                          0.129  s  0.80%  die                   [31 x   4.151 ms]
                  0.000  s  0.00%  flag_access           [14 x  29.325 us]
                  0.317  s  0.96%  offload               [7 x  45.275 ms]
                      0.132 ms  0.04%  replace_all_usages_with [7 x  18.869 us]
                     27.595 ms  8.71%  type_check            [14 x   1.971 ms]
                    289.201 ms 91.25%  [unaccounted]
                  0.888  s  2.68%  cfg_optimization      [7 x 126.901 ms]
                    678.513 ms 76.38%  store_to_load_forwarding [7 x  96.930 ms]
                        182.341 ms 26.87%  reaching_definition_analysis [7 x  26.049 ms]
                        496.172 ms 73.13%  [unaccounted]
                    203.234 ms 22.88%  dead_store_elimination [7 x  29.033 ms]
                         16.619 ms  8.18%  live_variable_analysis [7 x   2.374 ms]
                        186.615 ms 91.82%  [unaccounted]
                      3.622 ms  0.41%  die                   [7 x 517.436 us]
             26.785  s 44.73%  offload_to_executable [7 x   3.826  s]
                  0.039  s  0.15%  type_check            [21 x   1.873 ms]
                  0.030  s  0.11%  verify                [42 x 706.599 us]
                  0.037  s  0.14%  make_thread_local     [7 x   5.341 ms]
                     12.461 ms 33.33%  type_check            [7 x   1.780 ms]
                     24.925 ms 66.67%  [unaccounted]
                  0.000  s  0.00%  remove_range_assumption [7 x  10.422 us]
                  0.011  s  0.04%  die                   [7 x   1.630 ms]
                  0.001  s  0.00%  flag_access           [7 x  75.408 us]
                  0.082  s  0.31%  demote_atomics        [7 x  11.694 ms]
                     52.477 ms 64.11%  replace_all_usages_with [837 x  62.696 us]
                     16.997 ms 20.76%  type_check            [7 x   2.428 ms]
                     12.383 ms 15.13%  [unaccounted]
                  0.001  s  0.00%  demote_operations     [7 x 105.756 us]
                 26.550  s 99.12%  full_simplify         [7 x   3.793  s]
                      0.004  s  0.01%  extract_constant      [67 x  57.587 us]
                      0.007  s  0.02%  unreachable_code_elimination [67 x  98.623 us]
                      0.098  s  0.37%  binary_op_simplify    [67 x   1.462 ms]
                         76.587 ms 78.19%  replace_all_usages_with [2780 x  27.549 us]
                         21.360 ms 21.81%  [unaccounted]
                      0.114  s  0.43%  constant_fold         [67 x   1.700 ms]
                         79.873 ms 70.11%  replace_all_usages_with [2879 x  27.743 us]
                         16.242 ms 14.26%  compile               [2 x   8.121 ms]
                              0.107 ms  0.66%  compile_to_executable [2 x  53.406 us]
                                 46.968 us 43.97%  compile_to_offloads   [2 x  23.484 us]
                                      1.907 us  4.06%  lower_ast             [2 x 953.674 ns]
                                      3.099 us  6.60%  type_check            [2 x   1.550 us]
                                      6.676 us 14.21%  verify                [4 x   1.669 us]
                                      0.954 us  2.03%  demote_operations     [2 x 476.837 ns]
                                     33.140 us 70.56%  offload               [2 x  16.570 us]
                                          3.099 us  9.35%  type_check            [4 x 774.860 ns]
                                         30.041 us 90.65%  [unaccounted]
                                      1.192 us  2.54%  [unaccounted]
                                 57.936 us 54.24%  offload_to_executable [2 x  28.968 us]
                                      0.954 us  1.65%  type_check            [6 x 158.946 ns]
                                      5.722 us  9.88%  verify                [12 x 476.837 ns]
                                      1.192 us  2.06%  make_thread_local     [2 x 596.046 ns]
                                          0.000 us  0.00%  type_check            [2 x   0.000 ns]
                                          1.192 us 100.00%  [unaccounted]
                                      1.192 us  2.06%  remove_range_assumption [2 x 596.046 ns]
                                      1.907 us  3.29%  die                   [2 x 953.674 ns]
                                      0.000 us  0.00%  flag_access           [2 x   0.000 ns]
                                      1.192 us  2.06%  demote_atomics        [2 x 596.046 ns]
                                          0.000 us  0.00%  type_check            [2 x   0.000 ns]
                                          1.192 us 100.00%  [unaccounted]
                                      0.000 us  0.00%  demote_operations     [2 x   0.000 ns]
                                     36.001 us 62.14%  full_simplify         [2 x  18.001 us]
                                          0.000 us  0.00%  extract_constant      [2 x   0.000 ns]
                                          0.954 us  2.65%  unreachable_code_elimination [2 x 476.837 ns]
                                          1.907 us  5.30%  binary_op_simplify    [2 x 953.674 ns]
                                          0.000 us  0.00%  constant_fold         [2 x   0.000 ns]
                                          5.245 us 14.57%  die                   [6 x 874.201 ns]
                                          0.000 us  0.00%  alg_simp              [2 x   0.000 ns]
                                          0.000 us  0.00%  simplify              [2 x   0.000 ns]
                                          4.768 us 13.25%  whole_kernel_cse      [2 x   2.384 us]
                                         21.219 us 58.94%  cfg_optimization      [2 x  10.610 us]
                                              7.153 us 33.71%  store_to_load_forwarding [2 x   3.576 us]
                                                  3.099 us 43.33%  reaching_definition_analysis [2 x   1.550 us]
                                                  4.053 us 56.67%  [unaccounted]
                                              8.106 us 38.20%  dead_store_elimination [2 x   4.053 us]
                                                  3.815 us 47.06%  live_variable_analysis [2 x   1.907 us]
                                                  4.292 us 52.94%  [unaccounted]
                                              0.954 us  4.49%  die                   [2 x 476.837 ns]
                                              5.007 us 23.60%  [unaccounted]
                                          1.907 us  5.30%  [unaccounted]
                                      9.775 us 16.87%  [unaccounted]
                                  1.907 us  1.79%  [unaccounted]
                             16.128 ms 99.30%  compile               [2 x   8.064 ms]
                                 16.128 ms 100.00%  codegen              [2 x   8.064 ms]
                                      6.652 ms 41.25%  clone_struct_module   [2 x   3.326 ms]
                                      0.000 ms  0.00%  CodeGenLLVMCPU        [2 x   0.000 ns]
                                      0.062 ms  0.38%  emit_to_module        [2 x  30.994 us]
                                      9.402 ms 58.30%  compile_module_to_executable [2 x   4.701 ms]
                                          1.605 ms 17.07%  eliminate_unused_functions [2 x 802.517 us]
                                          6.323 ms 67.25%  global_optimize_module_cpu [2 x   3.161 ms]
                                              0.877 ms 13.87%  llvm_function_pass    [2 x 438.452 us]
                                              3.532 ms 55.86%  llvm_module_pass      [2 x   1.766 ms]
                                              1.914 ms 30.27%  [unaccounted]
                                          1.474 ms 15.68%  [unaccounted]
                         17.807 ms 15.63%  [unaccounted]
                      0.104  s  0.39%  die                   [201 x 519.639 us]
                      0.099  s  0.37%  alg_simp              [67 x   1.482 ms]
                         88.015 ms 88.63%  replace_all_usages_with [2719 x  32.370 us]
                         11.287 ms 11.37%  [unaccounted]
                     16.650  s 62.71%  simplify              [67 x 248.512 ms]
                          1.069  s  6.42%  replace_all_usages_with [13508 x  79.171 us]
                         14.471  s 86.91%  type_check            [5500 x   2.631 ms]
                          1.110  s  6.67%  [unaccounted]
                      9.208  s 34.68%  whole_kernel_cse      [60 x 153.461 ms]
                          2.304  s 25.03%  replace_all_usages_with [31297 x  73.631 us]
                          6.903  s 74.97%  [unaccounted]
                      0.265  s  1.00%  cfg_optimization      [60 x   4.421 ms]
                        108.261 ms 40.81%  store_to_load_forwarding [67 x   1.616 ms]
                             21.595 ms 19.95%  reaching_definition_analysis [67 x 322.313 us]
                             54.887 ms 50.70%  replace_all_usages_with [1660 x  33.064 us]
                             31.779 ms 29.35%  [unaccounted]
                        125.547 ms 47.33%  dead_store_elimination [67 x   1.874 ms]
                             60.096 ms 47.87%  live_variable_analysis [67 x 896.956 us]
                             65.451 ms 52.13%  [unaccounted]
                         28.903 ms 10.90%  die                   [60 x 481.725 us]
                          2.537 ms  0.96%  [unaccounted]
          0.027  m  2.60%  compile               [7 x 228.216 ms]
              1.598  s 100.00%  codegen              [7 x 228.215 ms]
                  0.025  s  1.59%  clone_struct_module   [7 x   3.625 ms]
                  0.000  s  0.00%  CodeGenLLVMCPU        [7 x 613.076 ns]
                  0.017  s  1.06%  emit_to_module        [7 x   2.418 ms]
                  1.555  s 97.33%  compile_module_to_executable [7 x 222.122 ms]
                      0.008  s  0.50%  eliminate_unused_functions [7 x   1.115 ms]
                      1.492  s 95.94%  global_optimize_module_cpu [7 x 213.096 ms]
                          0.026  s  1.74%  llvm_function_pass    [7 x   3.711 ms]
                          1.448  s 97.05%  llvm_module_pass      [7 x 206.812 ms]
                          0.018  s  1.21%  [unaccounted]
                      0.055  s  3.56%  [unaccounted]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

@yuanming-hu yuanming-hu reopened this Aug 29, 2020
@yuanming-hu
Copy link
Member Author

yuanming-hu commented Sep 3, 2020

Some performance analysis from perf report --no-children

Samples: 297K of event 'cycles:ppp', Event count (approx.): 322121848389
  Overhead  Command  Shared Object                                      Symbol                                                                   ◆
+   14.76%  python3  taichi_core.so                                     [.] taichi::lang::Stmt::replace_operand_with                             ▒
+    5.49%  python3  libstdc++.so.6.0.26                                [.] __cxxabiv1::__si_class_type_info::__do_dyncast                       ▒
+    4.93%  python3  libstdc++.so.6.0.26                                [.] __dynamic_cast                                                       ▒
+    4.81%  python3  libc-2.27.so                                       [.] cfree@GLIBC_2.2.5                                                    ▒
+    4.49%  python3  taichi_core.so                                     [.] taichi::lang::Stmt::has_operand                                      ▒
+    4.07%  python3  libc-2.27.so                                       [.] _int_malloc                                                          ▒
+    4.00%  python3  libc-2.27.so                                       [.] __strcmp_sse2_unaligned                                              ▒
+    3.95%  python3  taichi_core.so                                     [.] taichi::lang::promoted_type                                          ▒
+    3.54%  python3  taichi_core.so                                     [.] taichi::lang::StatementUsageReplace::visit                           ▒
+    2.71%  python3  libc-2.27.so                                       [.] malloc                                                               ▒
+    2.61%  python3  taichi_core.so                                     [.] taichi::lang::BasicStmtVisitor::visit                                ▒
+    2.49%  python3  taichi_core.so                                     [.] taichi::lang::IRNodeComparator::basic_check                          ▒
+    2.30%  python3  libstdc++.so.6.0.26                                [.] std::type_info::operator==                                           ▒
+    1.98%  python3  libstdc++.so.6.0.26                                [.] std::_Rb_tree_insert_and_rebalance                                   ▒
+    1.36%  python3  taichi_core.so                                     [.] taichi::lang::LowerAST::visit                                        ▒
+    1.14%  python3  taichi_core.so                                     [.] std::_Rb_tree<std::pair<taichi::lang::DataType, taichi::lang::DataTyp▒
+    1.08%  python3  libstdc++.so.6.0.26                                [.] __cxxabiv1::__class_type_info::__do_dyncast                          ▒
+    0.94%  python3  libstdc++.so.6.0.26                                [.] strcmp@plt                                                           ▒
+    0.87%  python3  taichi_core.so                                     [.] std::_Rb_tree<std::pair<taichi::lang::DataType, taichi::lang::DataTyp▒
+    0.79%  python3  taichi_core.so                                     [.] taichi::lang::IRVisitor::visit                                       ▒
+    0.79%  python3  python3.6                                          [.] _PyEval_EvalFrameDefault                                             ▒
+    0.74%  python3  taichi_core.so                                     [.] taichi::lang::WholeKernelCSE::visit                                  ▒
+    0.71%  python3  taichi_core.so                                     [.] taichi::lang::IRNodeComparator::get_other_id                         ▒
+    0.67%  python3  taichi_core.so                                     [.] std::_Rb_tree<std::pair<taichi::lang::DataType, taichi::lang::DataTyp▒
+    0.66%  python3  taichi_core.so                                     [.] taichi::lang::irpass::analysis::alias_analysis                       ▒
+    0.65%  python3  taichi_core.so                                     [.] taichi::lang::IRVisitor::visit                                       ▒
+    0.61%  python3  libstdc++.so.6.0.26                                [.] std::__detail::_Prime_rehash_policy::_M_next_bkt                     ▒
+    0.58%  python3  taichi_core.so                                     [.] taichi::lang::irpass::analysis::get_store_destination                ▒
+    0.57%  python3  libstdc++.so.6.0.26                                [.] operator new                                                         ▒
+    0.55%  python3  taichi_core.so                                     [.] taichi::lang::MarkUndone::visit                                      ▒
+    0.54%  python3  taichi_core.so                                     [.] std::_Hashtable<int, int, std::allocator<int>, std::__detail::_Identi▒
+    0.53%  python3  libstdc++.so.6.0.26                                [.] std::local_Rb_tree_rotate_left                                       ▒
+    0.53%  python3  taichi_core.so                                     [.] taichi::lang::CFGNode::get_store_forwarding_data                     ▒
+    0.52%  python3  taichi_core.so                                     [.] taichi::lang::BasicBlockSimplify::visit                              ▒
     0.48%  python3  taichi_core.so                                     [.] taichi::lang::StatementUsageReplace::visit                           ▒
     0.47%  python3  taichi_core.so                                     [.] taichi::lang::Simplify::visit                                        ▒
     0.46%  python3  libstdc++.so.6.0.26                                [.] std::__detail::_Prime_rehash_policy::_M_need_rehash                  ▒
     0.46%  python3  taichi_core.so                                     [.] taichi::lang::irpass::analysis::same_statements                      ▒
     0.45%  python3  taichi_core.so                                     [.] taichi::lang::BinaryOpStmt::accept                                   ▒
     0.44%  python3  taichi_core.so                                     [.] taichi::lang::Block::erase                                           ▒
     0.43%  python3  taichi_core.so                                     [.] taichi::lang::LocalLoadStmt::accept                                  

Compiled with

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb -g3 -fno-omit-frame-pointer")

@xumingkuan
Copy link
Collaborator

xumingkuan commented Sep 4, 2020

If Stmt::replace_operand_with takes too much time, maybe we want to keep track of every usage of every statement, so that we only need to traverse the statements having the statement to be deleted as an operand/operands.

We can add something like std::unordered_(multi)map<Stmt *, int> Stmt::usages. int is for the location of the operand or the number of the same operand. If we record neither the location nor the number (i.e., simply use std::unordered_set<Stmt *>), a statement changing from binary_add($1, $2) to binary_add($1, $1) and then to binary_add($1, $2) again will cause problems: we shouldn't remove this statement from $1->usages in this case.

@yuanming-hu
Copy link
Member Author

yuanming-hu commented Sep 7, 2020

Exactly.

It may be hard to maintain a DU/UD chain that is consistent with the IR through the compilation process, given how Taichi is designed at this point.

However, every time before we have a large number of invocation of replace_operand_with, we can build the structure you mentioned temporarily, and discard it after the transform pass is over so that we don't need to maintain it in the future.

@xumingkuan
Copy link
Collaborator

However, every time before we have a large number of invocation of replace_operand_with, we can build the structure you mentioned temporarily, and discard it after the transform pass is over so that we don't need to maintain it in the future.

Sounds good! But even if we only use the structure in cfg_optimization, we may need to maintain it in the cfg_optimization pass, i.e., in CFGNode::erase, insert, replace_with. I wonder if later other passes happen to have a large number of invocation of replace_operand_with, will we need to rewrite the code to maintain the structure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced optimization The issue or bug is related to advanced optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants