[Opt] Dead store and stack-related operation elimination by control-flow graph #1324

xumingkuan · 2020-06-25T03:09:08Z

Related issue = #926 #656
Related PR = #932 #1248
Replaces #859 #857

Change in build_cfg:

Add an empty final_node. Then assume there are reads to all global pointers after the final node (just like there are writes to all global pointers before the start node), for more convenient analysis on global pointers.
Tell each CFGNode if they are parallel executed (i.e. in an offloaded range_for/struct_for's body).

Change in CFGNode:

Add replace_with, deprecate erase_entire_node.
Use contain_variable for both reach_kill_variable (https://en.wikipedia.org/wiki/Reaching_definition) and live_kill_variable (https://en.wikipedia.org/wiki/Live_variable_analysis). Keep the function reach_kill_variable because reach_kill should contain definitions instead of variables by definition, but I store variables in the unordered_set for convenience.

Change in compile_to_offloads:

Remove the variable_optimization pass. (Faster compilation!)

Benchmark:

Compilation time (#926 (comment)):
6.872m -> 1.164m
(codegen_kernel_statements: 29087 -> 28573)

Number of statements:

The only case that becomes worse is:

@ti.all_archs
def test_offload_with_cross_block_locals3():
    ret = ti.var(ti.f32, shape=())

    @ti.kernel
    def ker():
        s = 1
        t = s
        for i in range(10):
            s += i
        ret[None] = t

    ker()

    assert ret[None] == 1

Previously, we didn't consider global temp as global ptrs, so we could simplify the kernel to ret[None] = 1...

[Click here for the format server]

… graph

xumingkuan · 2020-06-25T05:23:06Z

We can weaken an AtomicOpStmt to a load if there are no loads after it:

$1 = alloca
$2 = local store $1 <- const[10]
$3 = atomic add($1, const[5]) (can be replaced with local load $1)
$4 = print $3

We can do this optimization not only when atomic->dest is an alloca but also when it's a global temp -- as long as there are no loads after it before the next store.

However, we can't do this optimization if it's parallel executed:

$0 = offloaded range_for(0, 8) {
  $1 = global temp [offset=0]
  $2 = atomic add($1, const[5]) (cannot be replaced!)
  $3 = global ptr a
  $4 = global store $3 <- $2
}

xumingkuan · 2020-06-25T05:43:45Z

How to deal with the "load" in offloaded range for?

kernel {
  $0 = offloaded
  body {
    <i32*x1> $1 = global ptr [S4place_i32], index [] activate=false
    <i32 x1> $2 = global load $1
    <i32*x1> $3 = global tmp var (offset = 0 B)
    <i32*x1> $4 : global store [$3 <- $2]
    <i32*x1> $5 = global ptr [S6place_i32], index [] activate=false
    <i32 x1> $6 = global load $5
    <i32*x1> $7 = global tmp var (offset = 4 B)
    <i32*x1> $8 : global store [$7 <- $6]
  }
  $9 = offloaded range_for(tmp(offset=0B), tmp(offset=4B)) block_dim=adaptive
  body {
    <i32 x1> $10 = loop $9 index 0
    <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
    <i32 x1> $12 = atomic add($11, $10)
  }
}

If we don't deal with it, the global stores will be eliminated.

We need something like this:

taichi/taichi/transforms/variable_optimization.cpp

Lines 318 to 331 in 13462ff

    
           void visit(OffloadedStmt *stmt) override { 
        
             if (stmt->task_type == stmt->range_for) { 
        
               TI_ASSERT(!maybe_run); 
        
               if (!stmt->const_begin) { 
        
                 TI_ASSERT(state_machines.find(stmt->begin_offset) != 
        
                           state_machines.end()); 
        
                 state_machines[stmt->begin_offset].load(); 
        
               } 
        
               if (!stmt->const_end) { 
        
                 TI_ASSERT(state_machines.find(stmt->end_offset) != 
        
                           state_machines.end()); 
        
                 state_machines[stmt->end_offset].load(); 
        
               } 
        
             }

@yuanming-hu Do you have any ideas? I think we probably need to traverse the IR to find a global temp of the same offset with offload->begin_offset, and add an std::vector for each CFGNode denoting the loads not representable in the node...

xumingkuan · 2020-06-25T05:51:46Z

Are there better ways to support offloaded range fors without doing implicit loads...?

codecov · 2020-06-25T20:57:04Z

Codecov Report

Merging #1324 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1324   +/-   ##
=======================================
  Coverage   85.48%   85.48%           
=======================================
  Files          19       19           
  Lines        3375     3375           
  Branches      630      630           
=======================================
  Hits         2885     2885           
  Misses        358      358           
  Partials      132      132

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a434d3...00d1666. Read the comment docs.

yuanming-hu

Awesome! LGTM. It's a lot of work. Thank you so much!

xumingkuan added 4 commits June 24, 2020 23:06

[Opt] Dead store and dead stack-operation elimination by control-flow…

e8d540f

… graph

fix tests

2b09077

fix eliminating alloca

5ec5c93

fix atomics

c32b4a0

fix offload prologue and epilogue, and debug

62f0b65

xumingkuan added the large changeset The PR contains a large changeset and reviewing may take times label Jun 25, 2020

xumingkuan added 2 commits June 25, 2020 16:28

Consider global temp as a type of global ptr

070faa8

Finalize

2159c5f

xumingkuan changed the title ~~[Opt] Dead store and dead stack-operation elimination by control-flow graph~~ [Opt] Dead store and stack-related operation elimination by control-flow graph Jun 25, 2020

xumingkuan marked this pull request as ready for review June 25, 2020 20:58

xumingkuan mentioned this pull request Jun 25, 2020

Advanced optimization #656

Closed

18 tasks

xumingkuan added 2 commits June 25, 2020 16:59

Merge branch 'master' into store-elim

8026f12

Better store elimination in a basic block

00d1666

xumingkuan mentioned this pull request Jun 25, 2020

[opt] Improve optimizer performance #926

Open

xumingkuan requested a review from yuanming-hu June 25, 2020 22:05

yuanming-hu approved these changes Jun 26, 2020

View reviewed changes

xumingkuan merged commit 2a82113 into taichi-dev:master Jun 26, 2020

yuanming-hu mentioned this pull request Jun 28, 2020

[release] v0.6.14 #1346

Merged

xumingkuan deleted the store-elim branch July 7, 2020 01:06

xumingkuan mentioned this pull request Jul 7, 2020

[Opt] [bug] Better aliasing analysis for dead store elimination #1432

Merged

xumingkuan mentioned this pull request Aug 20, 2020

[Opt] Identical store/load elimination by control-flow graph #1741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Opt] Dead store and stack-related operation elimination by control-flow graph #1324

[Opt] Dead store and stack-related operation elimination by control-flow graph #1324

xumingkuan commented Jun 25, 2020 •

edited

Loading

xumingkuan commented Jun 25, 2020

xumingkuan commented Jun 25, 2020

xumingkuan commented Jun 25, 2020

codecov bot commented Jun 25, 2020 •

edited

Loading

yuanming-hu left a comment

[Opt] Dead store and stack-related operation elimination by control-flow graph #1324

[Opt] Dead store and stack-related operation elimination by control-flow graph #1324

Conversation

xumingkuan commented Jun 25, 2020 • edited Loading

xumingkuan commented Jun 25, 2020

xumingkuan commented Jun 25, 2020

xumingkuan commented Jun 25, 2020

codecov bot commented Jun 25, 2020 • edited Loading

Codecov Report

yuanming-hu left a comment

Choose a reason for hiding this comment

xumingkuan commented Jun 25, 2020 •

edited

Loading

codecov bot commented Jun 25, 2020 •

edited

Loading