Investigate performance regression from tokio channel upgrade #6043

jszwedko · 2021-01-14T16:01:11Z

It appears that #5868 may have introduced a performance regression. We should investigate and/or decide if it is worth the trade-off.

Specifically, I saw performance regress for the test-harness real_world_1_performance/default benchmark of around 7%:

--------------------------------------------------------------------------------
Test Comparison
Test name: real_world_1_performance
Test configuration: default
Subject: vector
Versions: dev-43325a343d4cf0f5156ec14575d00fa2c28bb50c dev-401a9a3d7eeae4891df17eecd920b533b2ee76a1
--------------------------------------------------------------------------------
Metric          | dev-401a9a3d7eeae4891df17ee... | dev-43325a343d4cf0f5156ec14...
----------------|--------------------------------|-------------------------------
Test count      | 1                              | 1                             
Duration (avg)  | 62s                            | 62s                           
Duration (max)  | 62s                            | 62s                           
CPU sys (max)   | 3 (+20%)                       | 2.5 W                         
CPU usr (max)   | 99.5 W                         | 99.5 W                        
Load 1m (avg)   | 1.6 W                          | 1.8 (+13%)                    
Mem used (max)  | 313.1 MiB (+2%)                | 305.1 MiB W                   
Disk read (avg) | 809.7 kib/s W                  | 806 kib/s (0%)                
Disk read (sum) | 49 MiB (+0%)                   | 48.8 MiB W                    
Disk writ (sum) | 80.8 MiB (+0%)                 | 80.7 MiB W                    
Net recv (avg)  | 6.4 MiB/s W                    | 5.9 MiB/s (-7%)               
Net recv (sum)  | 398.3 MiB W                    | 366.9 MiB (-7%)               
Net send (sum)  | 257.2 MiB                      | 238.3 MiB                     
TCP estab (avg) | 599                            | 599                           
TCP syn (avg)   | 0                              | 0                             
TCP close (avg) | 0                              | 0                             
--------------------------------------------------------------------------------
W = winner
vector = dev-401a9a3d7eeae4891df17eecd920b533b2ee76a1
vector = dev-43325a343d4cf0f5156ec14575d00fa2c28bb50c

In criterion, I see a regression in some of the topology benchmarks, but an improvement in others:

group                            before                                 after
-----                            ------                                 ---
complex/complex                  1.00       2.1±0.02s        ? B/sec    1.13       2.4±0.02s        ? B/sec
interconnected/interconnected    1.23    76.6±22.69ms    24.9 MB/sec    1.00    62.5±17.22ms    30.5 MB/sec
pipe/pipe_big_lines              1.07    164.4±2.59ms   116.0 MB/sec    1.00    153.4±1.95ms   124.3 MB/sec
pipe/pipe_multiple_writers       1.00     37.9±0.21ms     2.5 MB/sec    1.04     39.3±0.28ms     2.4 MB/sec
pipe/pipe_simple                 1.00     35.7±0.05ms    26.7 MB/sec    1.00     35.8±0.19ms    26.7 MB/sec
pipe/pipe_small_lines            1.01     27.9±0.03ms   350.1 KB/sec    1.00     27.6±0.17ms   354.3 KB/sec
transforms/transforms            1.00     60.7±0.16ms    17.3 MB/sec    1.05     63.9±0.07ms    16.4 MB/sec

Here complex/complex, transforms/transforms, pipe/pipe_simple, and pipe/pipe_multiple_writers show a regression, but the others show an improvement.

The text was updated successfully, but these errors were encountered:

binarylogic · 2021-01-14T16:06:38Z

cc @ktff and @lukesteensen. I'm curious if we should work to optimize the existing code or consider a revert.

jszwedko · 2021-01-14T18:32:50Z

Ran the rest of the test harness tests against this commit, there are a few that show regressions:
compare.txt

ktff · 2021-01-15T14:49:26Z

I think it's better to optimize. If the channels are indeed slower, it will be easier to migrate to a different futures 0.3/async channel than from what was before, and so that #5839 can move forward.

An alternative would be to revert, but to still investigate and optimize the current implementation since we will need to transition to it at some point.

And regarding regressions, it seams most of the slowdown comes from greater write to disk and resulting i/o contention therefor usual increase in CPU sys (max), other than that, it seams high_concurrency tests have regressed so
it seams both the multiple producer side and one consumer side have slow down, or the buffer is full since consumer is slower therefor impacting the producers by increasing contention.

binarylogic · 2021-01-15T15:13:04Z

Sounds good, I've assigned this to you. @jszwedko has been working on our benchmarking suite and we should use that to test for this.

ktff · 2021-01-29T14:39:00Z

It turned out that tokio-0.2 channels are indeed slower for our use case, so I've tried out future-0.3 and tokio-1.1 channels.

tokio-1.1 channels are a bit faster than tokio-0.2 channels, but still slower than original futures-0.1 channels.

future-0.3 channels are faster than futures-0.1 channels in few tests, and in the rest they are on par.

interconnected/interconnected is better by 6%
pipe/pipe_big_lines is better by 11%

That's on top of fixing this regression.

So let's go with future-0.3 channels.

Notes

In theory, tokio channels should be a bit faster since they are batching allocations, but I'm guessing futures channel approach better suits our use case.
A once allocated mpsc ring buffer should be the best approach for our use case of bounded channel with ~1000 capacity, but it's possible that futures channels would still be faster if the allocator and our use case have some kind of synergy.
From other alternatives, flume looked promising, but it's using Mutex underneath for Windows, and SpinLock for Unix, so I don't think its a good fit for us.
Didn't have luck in running our test-harness benches, I'll open an issue there.

binarylogic · 2021-01-29T15:07:07Z

cc @LucioFranco since he once said nothing was faster than Tokio 😁

jszwedko added the type: task Generic non-code related tasks label Jan 14, 2021

binarylogic assigned ktff Jan 15, 2021

binarylogic added this to the 2021-01-18 Tabula E-Rasa milestone Jan 15, 2021

ktff mentioned this issue Jan 29, 2021

fix: Switch to futures-0.3 channels #6283

Merged

4 tasks

jamtur01 modified the milestones: 2021-01-18 Tabula E-Rasa, 2021-02-01 D-Fuel Feb 1, 2021

ktff closed this as completed in #6283 Feb 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance regression from tokio channel upgrade #6043

Investigate performance regression from tokio channel upgrade #6043

jszwedko commented Jan 14, 2021

binarylogic commented Jan 14, 2021

jszwedko commented Jan 14, 2021

ktff commented Jan 15, 2021 •

edited

Loading

binarylogic commented Jan 15, 2021

ktff commented Jan 29, 2021 •

edited

Loading

binarylogic commented Jan 29, 2021

Investigate performance regression from tokio channel upgrade #6043

Investigate performance regression from tokio channel upgrade #6043

Comments

jszwedko commented Jan 14, 2021

binarylogic commented Jan 14, 2021

jszwedko commented Jan 14, 2021

ktff commented Jan 15, 2021 • edited Loading

binarylogic commented Jan 15, 2021

ktff commented Jan 29, 2021 • edited Loading

Notes

binarylogic commented Jan 29, 2021

ktff commented Jan 15, 2021 •

edited

Loading

ktff commented Jan 29, 2021 •

edited

Loading