Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance regression from tokio channel upgrade #6043

Closed
jszwedko opened this issue Jan 14, 2021 · 6 comments · Fixed by #6283
Closed

Investigate performance regression from tokio channel upgrade #6043

jszwedko opened this issue Jan 14, 2021 · 6 comments · Fixed by #6283
Assignees
Labels
type: task Generic non-code related tasks

Comments

@jszwedko
Copy link
Member

It appears that #5868 may have introduced a performance regression. We should investigate and/or decide if it is worth the trade-off.

Specifically, I saw performance regress for the test-harness real_world_1_performance/default benchmark of around 7%:

--------------------------------------------------------------------------------
Test Comparison
Test name: real_world_1_performance
Test configuration: default
Subject: vector
Versions: dev-43325a343d4cf0f5156ec14575d00fa2c28bb50c dev-401a9a3d7eeae4891df17eecd920b533b2ee76a1
--------------------------------------------------------------------------------
Metric          | dev-401a9a3d7eeae4891df17ee... | dev-43325a343d4cf0f5156ec14...
----------------|--------------------------------|-------------------------------
Test count      | 1                              | 1                             
Duration (avg)  | 62s                            | 62s                           
Duration (max)  | 62s                            | 62s                           
CPU sys (max)   | 3 (+20%)                       | 2.5 W                         
CPU usr (max)   | 99.5 W                         | 99.5 W                        
Load 1m (avg)   | 1.6 W                          | 1.8 (+13%)                    
Mem used (max)  | 313.1 MiB (+2%)                | 305.1 MiB W                   
Disk read (avg) | 809.7 kib/s W                  | 806 kib/s (0%)                
Disk read (sum) | 49 MiB (+0%)                   | 48.8 MiB W                    
Disk writ (sum) | 80.8 MiB (+0%)                 | 80.7 MiB W                    
Net recv (avg)  | 6.4 MiB/s W                    | 5.9 MiB/s (-7%)               
Net recv (sum)  | 398.3 MiB W                    | 366.9 MiB (-7%)               
Net send (sum)  | 257.2 MiB                      | 238.3 MiB                     
TCP estab (avg) | 599                            | 599                           
TCP syn (avg)   | 0                              | 0                             
TCP close (avg) | 0                              | 0                             
--------------------------------------------------------------------------------
W = winner
vector = dev-401a9a3d7eeae4891df17eecd920b533b2ee76a1
vector = dev-43325a343d4cf0f5156ec14575d00fa2c28bb50c

In criterion, I see a regression in some of the topology benchmarks, but an improvement in others:

group                            before                                 after
-----                            ------                                 ---
complex/complex                  1.00       2.1±0.02s        ? B/sec    1.13       2.4±0.02s        ? B/sec
interconnected/interconnected    1.23    76.6±22.69ms    24.9 MB/sec    1.00    62.5±17.22ms    30.5 MB/sec
pipe/pipe_big_lines              1.07    164.4±2.59ms   116.0 MB/sec    1.00    153.4±1.95ms   124.3 MB/sec
pipe/pipe_multiple_writers       1.00     37.9±0.21ms     2.5 MB/sec    1.04     39.3±0.28ms     2.4 MB/sec
pipe/pipe_simple                 1.00     35.7±0.05ms    26.7 MB/sec    1.00     35.8±0.19ms    26.7 MB/sec
pipe/pipe_small_lines            1.01     27.9±0.03ms   350.1 KB/sec    1.00     27.6±0.17ms   354.3 KB/sec
transforms/transforms            1.00     60.7±0.16ms    17.3 MB/sec    1.05     63.9±0.07ms    16.4 MB/sec

Here complex/complex, transforms/transforms, pipe/pipe_simple, and pipe/pipe_multiple_writers show a regression, but the others show an improvement.

@jszwedko jszwedko added the type: task Generic non-code related tasks label Jan 14, 2021
@binarylogic
Copy link
Contributor

cc @ktff and @lukesteensen. I'm curious if we should work to optimize the existing code or consider a revert.

@jszwedko
Copy link
Member Author

Ran the rest of the test harness tests against this commit, there are a few that show regressions:
compare.txt

@ktff
Copy link
Contributor

ktff commented Jan 15, 2021

I think it's better to optimize. If the channels are indeed slower, it will be easier to migrate to a different futures 0.3/async channel than from what was before, and so that #5839 can move forward.

An alternative would be to revert, but to still investigate and optimize the current implementation since we will need to transition to it at some point.

And regarding regressions, it seams most of the slowdown comes from greater write to disk and resulting i/o contention therefor usual increase in CPU sys (max), other than that, it seams high_concurrency tests have regressed so
it seams both the multiple producer side and one consumer side have slow down, or the buffer is full since consumer is slower therefor impacting the producers by increasing contention.

@binarylogic
Copy link
Contributor

Sounds good, I've assigned this to you. @jszwedko has been working on our benchmarking suite and we should use that to test for this.

@ktff
Copy link
Contributor

ktff commented Jan 29, 2021

It turned out that tokio-0.2 channels are indeed slower for our use case, so I've tried out future-0.3 and tokio-1.1 channels.

tokio-1.1 channels are a bit faster than tokio-0.2 channels, but still slower than original futures-0.1 channels.

future-0.3 channels are faster than futures-0.1 channels in few tests, and in the rest they are on par.

  • interconnected/interconnected is better by 6%
  • pipe/pipe_big_lines is better by 11%

That's on top of fixing this regression.

So let's go with future-0.3 channels.

Notes

  • In theory, tokio channels should be a bit faster since they are batching allocations, but I'm guessing futures channel approach better suits our use case.
  • A once allocated mpsc ring buffer should be the best approach for our use case of bounded channel with ~1000 capacity, but it's possible that futures channels would still be faster if the allocator and our use case have some kind of synergy.
  • From other alternatives, flume looked promising, but it's using Mutex underneath for Windows, and SpinLock for Unix, so I don't think its a good fit for us.
  • Didn't have luck in running our test-harness benches, I'll open an issue there.

@binarylogic
Copy link
Contributor

cc @LucioFranco since he once said nothing was faster than Tokio 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: task Generic non-code related tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants