Some optimizations (cont) #395

QuarticCat · 2022-09-30T04:16:04Z

This time I tried some radical optimizations.

Benchmark Approach

To make the result more accurate, I updated my benchmark approach. Here's the command:

cargo build --release &&
with-bench hyperfine --warmup=3 "$(echo ~/.cargo/target/release/difft sample_files/slow_{before,after}.rs)" &&
with-bench perf stat ~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null &&
time ~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null

where with-bench is a simple Zsh function that fixes CPU frequency and disables boost:

_bench-start() {
    sudo cpupower frequency-set -u 3.6G -d 3.6G >/dev/null
    sudo sh -c 'echo 0 > /sys/devices/system/cpu/cpufreq/boost'
    echo '>>>>> BENCH START' >&2
}
_bench-end() {
    sudo cpupower frequency-set -u 10G -d 0.1G >/dev/null
    sudo sh -c 'echo 1 > /sys/devices/system/cpu/cpufreq/boost'
    echo '>>>>> BENCH END' >&2
}
with-bench() {
    _bench-start
    trap '_bench-end' EXIT INT
    $@
}

Note that this time difft is directly invoked instead of through cargo run, so the speedup percentage will be higher (cargo run has a fixed extra cost).

Benchmark Results

Before my first PR:

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):      1.120 s ±  0.032 s    [User: 1.080 s, System: 0.039 s]
  Range (min … max):    1.058 s …  1.169 s    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

          1,113.87 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,423      page-faults:u             #    1.278 K/sec                  
     3,844,769,039      cycles:u                  #    3.452 GHz                    
       319,285,519      stalled-cycles-frontend:u #    8.30% frontend cycles idle   
     1,172,227,367      stalled-cycles-backend:u  #   30.49% backend cycles idle    
     4,498,772,345      instructions:u            #    1.17  insn per cycle         
                                                  #    0.26  stalled cycles per insn
       887,538,102      branches:u                #  796.804 M/sec                  
        19,835,182      branch-misses:u           #    2.23% of all branches        

       1.116052704 seconds time elapsed

       1.062544000 seconds user
       0.049941000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   1.03s  user 0.05s system 99% cpu 1.077 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                549 KB
page faults from disk:     0
other page faults:         1490

Before my second PR:

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     829.9 ms ±  15.8 ms    [User: 798.4 ms, System: 30.5 ms]
  Range (min … max):   811.7 ms … 864.5 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            824.10 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,361      page-faults:u             #    1.652 K/sec                  
     2,838,969,066      cycles:u                  #    3.445 GHz                    
        49,902,292      stalled-cycles-frontend:u #    1.76% frontend cycles idle   
     1,171,930,117      stalled-cycles-backend:u  #   41.28% backend cycles idle    
     3,511,062,625      instructions:u            #    1.24  insn per cycle         
                                                  #    0.33  stalled cycles per insn
       663,773,868      branches:u                #  805.457 M/sec                  
        18,248,650      branch-misses:u           #    2.75% of all branches        

       0.824474162 seconds time elapsed

       0.787025000 seconds user
       0.036682000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.68s  user 0.04s system 99% cpu 0.726 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                435 KB
page faults from disk:     0
other page faults:         1430

Now:

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     636.9 ms ±   7.4 ms    [User: 596.3 ms, System: 38.8 ms]
  Range (min … max):   629.2 ms … 655.3 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            623.62 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,348      page-faults:u             #    2.162 K/sec                  
     2,122,551,842      cycles:u                  #    3.404 GHz                    
        38,713,265      stalled-cycles-frontend:u #    1.82% frontend cycles idle   
       776,217,583      stalled-cycles-backend:u  #   36.57% backend cycles idle    
     2,669,241,946      instructions:u            #    1.26  insn per cycle         
                                                  #    0.29  stalled cycles per insn
       510,784,798      branches:u                #  819.070 M/sec                  
        15,456,464      branch-misses:u           #    3.03% of all branches        

       0.623958630 seconds time elapsed

       0.596642000 seconds user
       0.026663000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.51s  user 0.05s system 99% cpu 0.553 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                409 KB
page faults from disk:     0
other page faults:         1417

Conclusion

Speed: 100% -> 135% -> 176% (according to hyperfine)

Memory: 100% -> 79% -> 74%

Caveats

~~In commit Eliminate some vec clones, the memory usage abnormally increased, which was not in line with my expectation. I haven't figured out why.~~
In commit Change a RefCell in Vertex to UnsafeCell, a lot of unsafe code is applied, and they are apparently out of the boundary that they should stay. I don't know how to design abstractions for them.
In commit Refactor seen map, I don't understand why your original code was written in this way. I just faithfully convert your code into a faster one.

QuarticCat · 2022-09-30T11:22:19Z

Refactor parents' representation.

Speed: 187%

Memory: 73%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     600.1 ms ±   3.8 ms    [User: 567.1 ms, System: 31.5 ms]
  Range (min … max):   594.9 ms … 606.2 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            582.59 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,344      page-faults:u             #    2.307 K/sec                  
     1,976,757,963      cycles:u                  #    3.393 GHz                    
        35,702,044      stalled-cycles-frontend:u #    1.81% frontend cycles idle   
       760,550,837      stalled-cycles-backend:u  #   38.47% backend cycles idle    
     2,545,504,360      instructions:u            #    1.29  insn per cycle         
                                                  #    0.30  stalled cycles per insn
       472,339,368      branches:u                #  810.760 M/sec                  
        13,293,176      branch-misses:u           #    2.81% of all branches        

       0.582992026 seconds time elapsed

       0.552362000 seconds user
       0.030102000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.50s  user 0.03s system 99% cpu 0.528 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                401 KB
page faults from disk:     0
other page faults:         1418

QuarticCat · 2022-09-30T12:10:13Z

Compress EnteredDelimiter.

Speed: 189%

Memory: 71%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     591.4 ms ±  11.1 ms    [User: 559.8 ms, System: 30.9 ms]
  Range (min … max):   582.6 ms … 618.6 ms    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            576.92 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,339      page-faults:u             #    2.321 K/sec                  
     1,960,597,413      cycles:u                  #    3.398 GHz                    
        47,556,040      stalled-cycles-frontend:u #    2.43% frontend cycles idle   
       711,186,359      stalled-cycles-backend:u  #   36.27% backend cycles idle    
     2,530,607,392      instructions:u            #    1.29  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       467,326,782      branches:u                #  810.038 M/sec                  
        13,318,875      branch-misses:u           #    2.85% of all branches        

       0.577286658 seconds time elapsed

       0.529902000 seconds user
       0.046657000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.51s  user 0.02s system 99% cpu 0.530 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                391 KB
page faults from disk:     0
other page faults:         1408

QuarticCat · 2022-09-30T12:31:25Z

Reserve vec capacity.

Speed: 202%

Memory: 71%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     555.5 ms ±   5.2 ms    [User: 521.3 ms, System: 33.5 ms]
  Range (min … max):   549.2 ms … 564.5 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            560.13 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,339      page-faults:u             #    2.391 K/sec                  
     1,898,772,574      cycles:u                  #    3.390 GHz                    
        39,751,378      stalled-cycles-frontend:u #    2.09% frontend cycles idle   
       677,732,625      stalled-cycles-backend:u  #   35.69% backend cycles idle    
     2,385,056,239      instructions:u            #    1.26  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       439,507,344      branches:u                #  784.657 M/sec                  
        13,799,727      branch-misses:u           #    3.14% of all branches        

       0.560540216 seconds time elapsed

       0.526526000 seconds user
       0.033323000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.48s  user 0.03s system 99% cpu 0.511 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                391 KB
page faults from disk:     0
other page faults:         1404

QuarticCat · 2022-09-30T16:47:54Z

Compress seen map.

Here hashbrown is introduced since it has a get_key_value_mut method.

Speed: 207%

Memory: 68%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     540.0 ms ±  10.0 ms    [User: 509.0 ms, System: 30.0 ms]
  Range (min … max):   529.4 ms … 557.1 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            520.83 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,334      page-faults:u             #    2.561 K/sec                  
     1,761,757,697      cycles:u                  #    3.383 GHz                    
        35,457,864      stalled-cycles-frontend:u #    2.01% frontend cycles idle   
       601,687,452      stalled-cycles-backend:u  #   34.15% backend cycles idle    
     2,244,942,485      instructions:u            #    1.27  insn per cycle         
                                                  #    0.27  stalled cycles per insn
       418,526,690      branches:u                #  803.569 M/sec                  
        13,789,606      branch-misses:u           #    3.29% of all branches        

       0.521212130 seconds time elapsed

       0.483899000 seconds user
       0.036709000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.43s  user 0.04s system 99% cpu 0.469 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                375 KB
page faults from disk:     0
other page faults:         1396

QuarticCat · 2022-09-30T17:13:49Z

Skip visited vertices.

Speed: 212%

Memory: 68%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     529.0 ms ±   6.1 ms    [User: 494.6 ms, System: 33.7 ms]
  Range (min … max):   520.8 ms … 537.4 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            523.72 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,331      page-faults:u             #    2.541 K/sec                  
     1,774,187,486      cycles:u                  #    3.388 GHz                    
        44,683,168      stalled-cycles-frontend:u #    2.52% frontend cycles idle   
       588,601,309      stalled-cycles-backend:u  #   33.18% backend cycles idle    
     2,238,969,768      instructions:u            #    1.26  insn per cycle         
                                                  #    0.26  stalled cycles per insn
       419,261,132      branches:u                #  800.545 M/sec                  
        13,633,920      branch-misses:u           #    3.25% of all branches        

       0.523955753 seconds time elapsed

       0.490112000 seconds user
       0.033338000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.44s  user 0.02s system 99% cpu 0.465 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                375 KB
page faults from disk:     0
other page faults:         1398

QuarticCat · 2022-09-30T18:23:08Z

Refactor shortest path algorithm. This commit also removes some unsafe code.

Speed: 222%

Memory: 41%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     504.0 ms ±   6.5 ms    [User: 485.0 ms, System: 18.3 ms]
  Range (min … max):   498.3 ms … 519.4 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            496.49 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,257      page-faults:u             #    2.532 K/sec                  
     1,717,408,575      cycles:u                  #    3.459 GHz                    
        25,289,472      stalled-cycles-frontend:u #    1.47% frontend cycles idle   
       633,182,189      stalled-cycles-backend:u  #   36.87% backend cycles idle    
     2,244,678,474      instructions:u            #    1.31  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       424,478,871      branches:u                #  854.952 M/sec                  
        13,665,200      branch-misses:u           #    3.22% of all branches        

       0.496761488 seconds time elapsed

       0.482851000 seconds user
       0.013317000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.42s  user 0.03s system 99% cpu 0.447 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                227 KB
page faults from disk:     0
other page faults:         1327

QuarticCat · 2022-09-30T18:31:35Z

Done. You can merge this PR now. If I have other optimizations I will open another PR.

The next big optimization opportunity might be parallelizing the for ... in possibly_changed loop. But it involves so many Cells that I feel tired to refactor.

Wilfred · 2022-10-01T13:04:50Z

Wow, this is really cool. I'm not sure about the changes to SeenMap: I deliberately wanted a vec so I could easily experiment with different sizes. The rest looks good at first glance, I think the use of a visited flag on graph nodes is particularly nice.

I'm a little busy at the moment, but I will do a proper merge and review as soon as I can :)

QuarticCat · 2022-10-05T13:45:05Z

The third and maybe the last big performance improvement PR is ready. Given that you haven't merged this one, I would like to know that do you prefer adding them into this PR?

QuarticCat · 2022-10-06T00:36:35Z

Fix a small problem: I was using ZSH's built-in time command to measure the memory usage, and the format was set to

TIMEFMT="\
%J   %U  user %S system %P cpu %*E total
avg shared (code):         %X KB
avg unshared (data/stack): %D KB
total (sum):               %K KB
max memory:                %M KB
page faults from disk:     %F
other page faults:         %R"

According to ZSH's doc, %M is in KB. But it was actually MB. Anyway, that doesn't affect the percentage.

Wilfred · 2022-10-07T05:39:12Z

OK, I've cherry-picked the first three commits and I'll follow up on the rest when I can :)

If you have further awesome improvements, perhaps it would be clearer as a separate PR? I don't feel strongly though.

QuarticCat · 2022-10-12T08:37:28Z

Any update? If you have any questions, feel free to ask me. It's my pleasure to explain my optimizations.

chore: generate and sync latest changes

QuarticCat added 11 commits September 30, 2022 02:38

Simplify push_{lhs,rhs}_delimiter

7cd3b10

Remove field can_pop_either from Vertex

97f9a52

Reduce number of branches of Vertex::eq

9f1a0ab

Compress &Syntax and SyntaxId into a usize

d2f5e99

Simplify get_set_neighbours using Default::default

b635df4

Simplify get_set_neighbours by extracting functions

69b9214

Eliminate some vec clones

827e26d

Split predecessor to two parts

1631eb2

Change a RefCell in Vertex to UnsafeCell

9c551ab

Refactor seen map

7d36c5f

Minor fixes

2293886

QuarticCat force-pushed the master branch from 80fd488 to 2293886 Compare September 30, 2022 04:55

Refactor parents representation

2c6b706

Compress EnteredDelimiter

cb1c3e0

Simplify set_neighbours by extracting a lambda

4e450cb

QuarticCat force-pushed the master branch from 1722228 to 4e450cb Compare September 30, 2022 12:27

Reserve vec capacity

b0ab6c8

Remove transmute_copy

5043858

QuarticCat force-pushed the master branch from e5516b1 to 5043858 Compare September 30, 2022 14:34

Compress seen map

8327c99

QuarticCat added 2 commits October 1, 2022 01:14

Skip visited vertices

3612d08

Refactor shortest path algorithm

9e11a22

QuarticCat mentioned this pull request Oct 3, 2022

Use successors #398

Closed

QuarticCat mentioned this pull request Oct 7, 2022

Some optimizations (3) #401

Open

Wilfred added a commit that referenced this pull request Oct 14, 2022

Mention perf improvements from #393 and #395

6b0009c

Wilfred mentioned this pull request Nov 23, 2022

Show function context for diff hunks. #402

Closed

hugo-vrijswijk pushed a commit to hugo-vrijswijk/difftastic that referenced this pull request Jul 23, 2024

Merge pull request Wilfred#395 from tree-sitter/generation

b76db43

chore: generate and sync latest changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimizations (cont) #395

Some optimizations (cont) #395

QuarticCat commented Sep 30, 2022 •

edited

Loading

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

Wilfred commented Oct 1, 2022

QuarticCat commented Oct 5, 2022

QuarticCat commented Oct 6, 2022

Wilfred commented Oct 7, 2022

QuarticCat commented Oct 12, 2022

Some optimizations (cont) #395

Are you sure you want to change the base?

Some optimizations (cont) #395

Conversation

QuarticCat commented Sep 30, 2022 • edited Loading

Benchmark Approach

Benchmark Results

Conclusion

Caveats

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

QuarticCat commented Sep 30, 2022

Wilfred commented Oct 1, 2022

QuarticCat commented Oct 5, 2022

QuarticCat commented Oct 6, 2022

Wilfred commented Oct 7, 2022

QuarticCat commented Oct 12, 2022

QuarticCat commented Sep 30, 2022 •

edited

Loading