Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiple single threaded runtimes #23

Open
programatik29 opened this issue Nov 11, 2021 · 9 comments
Open

Use multiple single threaded runtimes #23

programatik29 opened this issue Nov 11, 2021 · 9 comments

Comments

@programatik29
Copy link
Contributor

With recent updates to tokio, CPU can't be utilized to 100% with tokio::spawn when using hyper. Performance of rewrk can be increased if multiple single threaded runtimes are used.

I think it is a good idea because all connection tasks spawned by rewrk are identical so work stealing benefits aren't much.

@programatik29
Copy link
Contributor Author

This branch is using multiple single-threaded runtimes.

Results on my computer:

Multiple single threaded runtimes:

Beginning round 1...
Benchmarking 500 connections @ http://localhost:3000 for 10 second(s)
  Latencies:
    Avg      Stdev    Min      Max
    1.71ms   1.73ms   0.03ms   68.96ms
  Requests:
    Total: 2857417 Req/Sec: 286210.95
  Transfer:
    Total: 313.38 MB Transfer Rate: 31.39 MB/Sec

Current approach, regular tokio::spawn:

Beginning round 1...
Benchmarking 500 connections @ http://localhost:3000 for 10 second(s)
  Latencies:
    Avg      Stdev    Min      Max
    2.21ms   1.25ms   0.04ms   51.12ms
  Requests:
    Total: 2247341 Req/Sec: 225330.90
  Transfer:
    Total: 246.47 MB Transfer Rate: 24.71 MB/Sec

@jschwe
Copy link

jschwe commented Dec 14, 2022

When using tokio with our proposed BWoS workstealing queue I measured the following improvements for hypers throughput when changing the rewrk tokio queue backend. The following throughput measurements only modified rewrk and left hyper untouched (using rust-web-benchmarks).

500 connections

Rewrk 0.3.2 with BWoS queue:

Framework Name Latency.Avg Latency.Stdev Latency.Min Latency.Max Request.Total Request.Req/Sec Transfer.Total Transfer.Rate Max. Memory Usage
hyper 0.89ms 0.40ms 0.03ms 212.56ms 16401497 546698.77 1.36GB 46.40MB/Sec 16.5MB

Rewrk 0.3.2 with original tokio (1.22):

Framework Name Latency.Avg Latency.Stdev Latency.Min Latency.Max Request.Total Request.Req/Sec Transfer.Total Transfer.Rate Max. Memory Usage
hyper 1.04ms 0.44ms 0.03ms 9.12ms 14002501 466735.26 1.16GB 39.62MB/Sec 14.6MB

1000 connections

Rewrk 0.3.2 with BWoS queue:

Framework Name Latency.Avg Latency.Stdev Latency.Min Latency.Max Request.Total Request.Req/Sec Transfer.Total Transfer.Rate Max. Memory Usage
hyper 1.18ms 0.69ms 0.03ms 20.85ms 24647183 821524.47 2.04GB 69.73MB/Sec 26.5MB

Rewrk 0.3.2 with original tokio (1.22):

Framework Name Latency.Avg Latency.Stdev Latency.Min Latency.Max Request.Total Request.Req/Sec Transfer.Total Transfer.Rate Max. Memory Usage
hyper 1.28ms 0.53ms 0.05ms 9.23ms 23019349 767279.90 1.91GB 65.12MB/Sec 26.5MB

The throughput increase (if only rewrk is changed) is not as much was what you measured, but still I thought this might be interesting.

@programatik29
Copy link
Contributor Author

@jschwe Interesting... How does that compare to rewrk in this pull request?

@jschwe
Copy link

jschwe commented Dec 14, 2022

Ah, I did not see there was a pull request related to this issue.

I'll try this out tommorow.

Edit: Initial results do show that using multiple single threaded runtime still offers significant performance improvements. This is probably related to parking/unparking overhead. I'll post some more details tommorow.

@ChillFish8
Copy link
Member

It would be interesting to see, if the performance is close enough I'm somewhat tempted to go with the approach you mentioned over the single threaded runtimes just for convenience.

@jschwe
Copy link

jschwe commented Dec 16, 2022

I compared the single threaded approach with the original strategy and differen BWoS strategies on an x86 machine with 2 numa nodes. rewrk was bound to Numa node 1 (44 cores including hyperthreads) and the hyper benchmark was bound to Numa node 0. This is the benchmarking script I used.

It seems that the single threaded runtime does offer clear advantages, especially with less connections. This could be due to overhead from parking/unparking cores when there is not enough work, but I haven't investigated this further.

500 connections

<style> </style>
Hyper tokio stealing strategy Rewrk strategy rewrk throughput (MB/s)
original single_thread 65.65
original original 37.29
original bwos steal half 44.96
original bwos_steal_block 45.78
original bwos_steal_1 37.02
bwos_steal_half single_thread 91.87
bwos_steal_block single_thread 89.18
bwos_steal_block bwos_steal_block 71.69

750 connections

<style> </style>
Hyper tokio stealing strategy Rewrk strategy rewrk throughput (MB/s)
original single_thread 71.81
original original 52.08
original bwos steal half 53.9
original bwos_steal_block 55.41
original bwos_steal_1 46.66
bwos_steal_half single_thread 82.23
bwos_steal_block single_thread 83.96
bwos_steal_block bwos_steal_block 77.51

1000 connections

<style> </style>
Hyper tokio stealing strategy Rewrk strategy rewrk throughput (MB/s)
original single_thread 75.4
original original 64.33
original bwos steal half 63.21
original bwos_steal_block 63.36
original bwos_steal_1 56.37
bwos_steal_half single_thread 81.54
bwos_steal_block single_thread 82.48
bwos_steal_block bwos_steal_block 79.37

@programatik29
Copy link
Contributor Author

@jschwe So is it free performance if this design gets merged into tokio?

@jschwe
Copy link

jschwe commented Dec 18, 2022

@programatik29 Our current proposal is for our queue to be an integrated as an alternative Backend, which could be selected via a runtime flavor. So this is not going to be a drop in change for now, but would require the downstream user to select a different flavor to get the benefits of the new queue. That's a very minor change required for downstream users though.

We do think our queue should be better in basically all scenarios, but there are situations where the queue is not the bottleneck, so switching it out wouldn't change much.

@ChillFish8
Copy link
Member

This behaviour will be implemented in rewrk-core, moving the CLI tool over to this will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants