Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace tokio-timer with futures-timer for higher timer resolution #5

Merged
merged 1 commit into from
Nov 18, 2019

Conversation

doyoubi
Copy link
Contributor

@doyoubi doyoubi commented Oct 23, 2019

Current timer resolution is low for some use cases: #4

This PR replaces the tokio-timer with futures-timer for higher resolution.

I still need to compare their implementation to be clear about the following questions:

  • (1) Why tokio-timer does not support higher resolution timer?
  • (2) Is there any problem with the futures-timer for higher resolution than tokio-timer?
  • (3) What's the resolution of futures-timer?

Cargo.toml Outdated
@@ -9,7 +9,7 @@ edition = "2018"
[dependencies]
tokio = "0.1.14"
futures = "0.1.25"
tokio-timer = "0.2.8"
futures-timer = "0.1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any higher version requires changing from futures::Future to std::future::Future.

@doyoubi
Copy link
Contributor Author

doyoubi commented Oct 25, 2019

(1) Why tokio-timer does not support higher resolution timer?
https://tokio.rs/blog/2018-03-timers/

(2) Is there any problem with the futures-timer for higher resolution than tokio-timer?
For waiting, futures-rs uses thread::park_timeout while tokio-timer uses Park::park_timeout, both of which have no guarantee on the resolution. But in reality, futures-rs is likely to have higher resolution.

(3) What's the resolution of futures-timer?
It should depend on the underlying OS. In my Linux environment, it has a granularity of about 10 microseconds.

For my use case on batching syscall, replacing the timer to futures-timer does not work.
When the duration is about 20 microseconds, the Delay will trigger lots of conditional variable calls that ruin the performance.
My solution is to use two timers with different timeout parameters, which may be too specific to my use case.

@mre
Copy link
Owner

mre commented Oct 25, 2019

Thanks for the summary.

When the duration is about 20 microseconds, the Delay will trigger lots of conditional variable calls that ruin the performance.

I'd be interested in a more detailed explanation on this
Did you benchmark that? What are the conditionals exactly? Is there a code snippet I could look at? Are the maintainers aware or shall we create an issue to discuss possible options on their side?

My solution is to use two timers with different timeout parameters, which may be too specific to my use case.

Glad you found a solution. If you'd like to continue working on your WIP, you'd be more than welcome to do so. A higher-resolution timer might be beneficial in many other cases.

@doyoubi
Copy link
Contributor Author

doyoubi commented Oct 28, 2019

Later I will post more details on the performance problem of the high-resolution timer with a low timeout.

Did you benchmark that?
What are the conditionals exactly?

I used the modified tokio-batch in my Redis proxy which is a level 7 network protocol proxy and benchmark it. In my sampled flamegraph I can see a bunch of conditional variable calls that may come from tokio runtime for implementing waking up tasks.

Is there a code snippet I could look at?

I will write some sample codes to reproduce that.

Are the maintainers aware or shall we create an issue to discuss possible options on their side?

This may be a common problem for any coroutine runtime. We also encountered this problem before in the Golang runtime. Low timeout value will trigger lots of wake-up operations, resulting in a bunch of pthread_cond_wait or pthread_cond_timedwait in the runtime. I think it's more a usage problem for users.

@mre
Copy link
Owner

mre commented Oct 28, 2019

You've been running the benchmarks in release mode or? (Just to be sure. 😉)

Might be wrong, but I think futures-timer uses parking_lot under the hood, right?
Think this is the pthread_cond_wait you're talking about.
There's also a discussion about Futures support here.

Thanks for your investigation. Would be really curious about your writeup. You could also raise this on the user forum or the subreddit if you like.

@doyoubi
Copy link
Contributor Author

doyoubi commented Oct 28, 2019

You've been running the benchmarks in release mode or? (Just to be sure. 😉)

Yes, it's in release mode.

Might be wrong, but I think futures-timer uses parking_lot under the hood, right?

Seems like in 0.1 it does not depend on the parking_lot crate: Cargo.toml of 0.1.

@doyoubi
Copy link
Contributor Author

doyoubi commented Nov 2, 2019

  • Remove tokio-timer so we don't need futures01 anymore.
  • Add an additional light-weight passive timer to optimize the case that timeout duration needs to be very small.

TODO:

  • More explanation
  • Benchmark

@doyoubi
Copy link
Contributor Author

doyoubi commented Nov 9, 2019

It seems like in the new tokio runtime, the performance problem of pthread_cond_wait no longer exists!
What we need to do here is just replacing the tokio-timer with futures-timer.

Here's an example: tokio-batch-benchmark.
I wrote a simple Redis server with my customedtokio-batch, so that we can easily use redis-benchmark to benchmark the result.

Build the example

$ git clone https://github.com/doyoubi/tokio-batch-benchmark.git
$ cd tokio-batch-benchmark
$ make

Benchmark without batching

# Run the server without batching
$ make nobatch

# In another terminal
$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

# Result
PING_INLINE: 230771.00 requests per second
PING_BULK: 232780.09 requests per second

Benchmark with just one timer driven by futures-timer

# Run the server with just a timer
$ make max

# In another terminal
$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

# Result
PING_INLINE: 1338329.75 requests per second
PING_BULK: 1335470.12 requests per second

Benchmark with two timers, a light-weight timer and a timer driven by futures-timer

# Run the server with the light-weight timer
$ make minmax

# In another terminal
$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

# Result
PING_INLINE: 1498351.88 requests per second
PING_BULK: 1368176.25 requests per second

Adding the light-weight timer has not significant difference.

By the way:

  • I've found a performance problem in SplitSink of futures-rs which disables the batching: issue. This example uses a customized SplitSink to work around it.

UPDATE:
It has been fixed, but not released yet.

  • It's even 2 times faster than Redis on my Mac.

@doyoubi doyoubi changed the title [WIP] Replace tokio-timer with futures-timer for higher timer resolution Replace tokio-timer with futures-timer for higher timer resolution Nov 9, 2019
@mre
Copy link
Owner

mre commented Nov 9, 2019

Wow. That's great news! Looking forward to review and merge this. I'm at RustFest at the moment so it'll take some time until I get to it but I'll try my best to get it in.

@mre
Copy link
Owner

mre commented Nov 17, 2019

For comparison, here are my numbers:

# make nobatch
PING_INLINE: 192927.28 requests per second
PING_BULK: 193031.56 requests per second

# make max
PING_INLINE: 213867.14 requests per second
PING_BULK: 216104.08 requests per second

# make minmax
PING_INLINE: 249731.55 requests per second
PING_BULK: 250237.72 requests per second

(Macbook Pro Mid 2015, 2.2 GHz Intel Core i7, 16 GB RAM)

@doyoubi
Copy link
Contributor Author

doyoubi commented Nov 17, 2019

How do you run the redis-benchmark? It needs the -P <pipline number larger than 1> to enable pipelining:

$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 3

Have you pulled the latest codes of tokio-batch-benchmark?

Try running this to check whether there is any error message:

$ redis-benchmark -e -n 10000000 -t ping -c 32 -P 3

I will also check it on other computers.

@mre
Copy link
Owner

mre commented Nov 17, 2019

Yup, just like you described:

redis-benchmark -q -n 10000000 -t ping -c 32 -P 3

(In all three cases.)

@doyoubi
Copy link
Contributor Author

doyoubi commented Nov 17, 2019

Oh, my fault, it's -P 32:

$ redis-benchmark -q -n 10000000 -t ping -c 32 -P 32

It needs a higher throughput to see the difference.
I missed that two -P 3[2] when I pasted it.

@mre
Copy link
Owner

mre commented Nov 17, 2019

Ah right. Makes more sense now; although I'm confused about nobatch.

# make nobatch
PING_INLINE: 250871.78 requests per second
PING_BULK: 246761.25 requests per second

# make max
PING_INLINE: 1704448.62 requests per second
PING_BULK: 1707941.88 requests per second

# make minmax
PING_INLINE: 1661405.50 requests per second
PING_BULK: 1570598.38 requests per second

@doyoubi
Copy link
Contributor Author

doyoubi commented Nov 17, 2019

  • nobatch means not using tokio-batch at all.
  • max has the same implementation of this pull request.
  • minmax adds another light-weight timer to reduce the number of pthread_cond_wait calls to try to further optimize it. But the result is almost the same as max.

@mre
Copy link
Owner

mre commented Nov 17, 2019

But why is nobatch so much slower than the others on my machine? Shouldn't it be faster or at least equal?

@doyoubi
Copy link
Contributor Author

doyoubi commented Nov 18, 2019

Well, if we set the timeout of batching to a large number, it will be slower. But here the timeout is
20 microseconds, which could even improve throughput.

Let's ignore the light-weight timer case first. Using tokio-batch on the syscall sendto (the max case) should achieve higher throughput than not using tokio-batch at all (the nobatch case).

The key point of batching syscalls for optimization is that sending more data in a syscall does not significantly increase the latency of syscalls. Sending 16 bytes spends almost the same time as sending 8 bytes. So the more data packed in every sendto syscall, the more throughput we will have.

On OSX, we can use dtrace to collect some statistics on the behavior of syscall.

Nobatch case

We can collect the number of syscalls and the data sent by sendto while benchmarking:

# sudo dtrace -n 'syscall::sendto:entry { @num["count"] = count(); @bytes["bytes"] = sum(arg2); } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

dtrace: description 'syscall::sendto:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  18373                        :tick-10s

  count                                                       1454801
  bytes                                                      10184009

Within 10 seconds, it triggered 1454801 sendto and sent 10184009 bytes in total, which gives 7 bytes per syscall on average.

We can also collect the latency of syscall:

# sudo dtrace -n 'syscall::sendto:entry { self->ts = timestamp; } syscall::sendto:return /self->ts/ { @ns["latency"] = quantize(timestamp - self->ts); self->ts = 0; } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

CPU     ID                    FUNCTION:NAME
  2  18373                        :tick-10s

  latency
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@                         33097
            4096 |@@@@@@@@@@@@@@@@@@@@@@@                  48365
            8192 |@                                        2284
           16384 |                                         731
           32768 |                                         221
           65536 |                                         42
          131072 |                                         13
          262144 |                                         12
          524288 |                                         5
         1048576 |                                         2
         2097152 |                                         1
         4194304 |                                         4
         8388608 |                                         0

Most sendto syscall spend around 2048 to 4096 nanoseconds to finish the syscall inside the kernel.

Batching with timers

Collect the data for the case that using tokio-batch.

# sudo dtrace -n 'syscall::sendto:entry { @num["count"] = count(); @bytes["bytes"] = sum(arg2); } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

dtrace: description 'syscall::sendto:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  18373                        :tick-10s

  count                                                        341701
  bytes                                                      76533848

Within 10 seconds, we have fewer syscalls, 341701 compared to the previous 1454801, but have more throughput, 76533848 compared to 10184009. Since the length of the network packet of Redis PING is fixed, we can see the same scale as the QPS we saw in redis-benchmark around 7 times larger. On average, we have 223.9 bytes on each syscall this time.

And the latency of sendto didn't increase a lot.

# sudo dtrace -n 'syscall::sendto:entry { self->ts = timestamp; } syscall::sendto:return /self->ts/ { @ns["latency"] = quantize(timestamp - self->ts); self->ts = 0; } tick-10s { exit(0); }' -p <pid of redis-ping-pong-server in tokio-batch-benchmark>

CPU     ID                    FUNCTION:NAME
  2  18373                        :tick-10s

  latency
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@                                        1096
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      68887
            8192 |@@@@                                     8056
           16384 |@                                        1143
           32768 |                                         279
           65536 |                                         53
          131072 |                                         26
          262144 |                                         20
          524288 |                                         11
         1048576 |                                         3
         2097152 |                                         1
         4194304 |                                         0
         8388608 |                                         2
        16777216 |                                         0

The flamegraph I generated in my Linux test environment shows that 88% of the time in this fake Redis server is spent on the syscall. It keeps firing the sendto syscalls. The more data we can process in each syscall, the larger throughput we have.

@mre
Copy link
Owner

mre commented Nov 18, 2019

That makes a lot of sense. Thanks for the very detailed explanation and for creating this pull request in general. I'll merge it now. 🎉 Will publish a new version right after, but we should also bump the version once your fix for SplitSink has landed.

@mre mre merged commit 661b5b4 into mre:master Nov 18, 2019
@doyoubi
Copy link
Contributor Author

doyoubi commented Nov 18, 2019

Thanks for creating this crate! This is very helpful for optimizing most network services. I'm so glad to work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants