-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability issue visible when the number of concurrent requests > 128 #362
Comments
Interestingly, when I decrease the amount of concurrent requests from 384 to only 64, scylla_db now outperforms cassandra_cpp on almost every metric (except min latency). I just wonder, maybe the internal driver queue / buffer is too small? |
BTW: I'm not saying these latencies from scylla are worse - actually in REAL app I'd take lower tail latency over larger min latency every time. However, I'm just curious why I'm getting those numbers. Maybe there is an opportunity to improve something even more. |
Looking perf stats, one thing I notice is that scylla branch spends a lot more time in the system, and makes 2x more context switches:
|
Both runs 1000000 iterations of read.rn benchmark, single thread:
Althouth there is a huge improvement since 0.3.0, it looks like scylla driver still makes significantly more syscalls than cassandra_cpp. Looks like it splits data into smaller chunks and the system overhead is higher. Can we have an option to configure scylla driver to make bigger batches when high throughput is more important than latency? Or when simply when a high number of concurrent requests are scheduled (many pending calls to I noticed there is a hardcoded buffer size of 8196 bytes in your code. I was wondering - does the size of that buffer affect batching behaviour? Maybe it would make sense to expose its size in the options? BTW: On a fixed rate test I'm getting a lot better latencies from scylla driver, in particular p75-p99.9. |
Indeed, increasing buffer sizes will increase throughput at the cost of latency, up to a point. The whole idea is to find a sweet spot for your workload and hardware, and you're right that it's not currently possible in scylla-rust-driver, because the sizes are hardcoded. I was planning to push a patch that makes the sizes configurable - I'll put it higher on my priority list and push a pull request early next week. |
Cool, let me quickly fork it then and check if changing that buffer size makes any difference first ;) |
Sure. Btw, no need to fork the repo - once I send a pull request to a branch, you can specify a git branch as dependency in Cargo.toml: |
I increased the buffers size to 32k and can't see any change in the numbers of |
What is the policy of flushing those buffers? Also what controls the frequency of reading from socket into the read buffer? Maybe it reads so frequently that there isn't enough data in the socket to read? |
So I guess this is this is the main loop: while let Some(mut task) = task_receiver.recv().await {
let mut num_requests = 0;
let mut total_sent = 0;
while let Some(stream_id) = Self::alloc_stream_id(handler_map, task.response_handler) {
let mut req = task.serialized_request;
req.set_stream(stream_id);
let req_data: &[u8] = req.get_data();
total_sent += req_data.len();
num_requests += 1;
write_half.write_all(req_data).await?;
task = match task_receiver.try_recv() {
Ok(t) => t,
Err(_) => break,
}
}
trace!("Sending {} requests; {} bytes", num_requests, total_sent);
write_half.flush().await?;
} After enabling logging, I noticed this only initially sends a full buffer with requests, and then it quickly drops down to sending low numbers of requests:
The client app uses Hence, the buffer size does not matter that much - the driver is very "eager" - whatever comes it jumps to process it and doesn't let things accumulate. BTW: This problem was realized by the creators of DataStax C++ driver and there is a configurable delay (default 200 µs), where the driver waits for more requests in order to fill up the buffer - so if the buffer doesn't fill up, it adds at worst those 200 µs to the latency, but saves many syscalls by potentially batching more queries at once. If you wish I can try to implement such configurable delay if time permits this weekend (could be turned off by default so it wouldn't cause any regression for latency sensitive workloads). |
Of course, all contributors are welcome. We'll probably want to preserve current behavior as default, but allowing the driver to explicitly accumulate the requests for X microseconds sounds like a very nice tuning option. |
I've spent a bit of time today on this, and there is something weird going on, but I haven't figured that out yet. :( Because there I couldn't find anything to do async sleep with submillisecond timing in tokio, my first attempt was to just spin on while let Some(mut task) = task_receiver.recv().await {
let start_time = Instant::now();
let mut num_requests = 0;
let mut total_sent = 0;
'outer: while let Some(stream_id) =
Self::alloc_stream_id(handler_map, task.response_handler)
{
let mut req = task.serialized_request;
req.set_stream(stream_id);
let req_data: &[u8] = req.get_data();
total_sent += req_data.len();
num_requests += 1;
write_half.write_all(req_data).await?;
task = loop {
match task_receiver.try_recv() {
Ok(t) => break t,
Err(_) => {
if Instant::now() - start_time < tokio::time::Duration::from_micros(200) && total_sent < 8196
{
tokio::task::yield_now().await;
} else {
break 'outer;
}
}
}
}
} Then I tried blocking with So I conclude, maybe those syscalls are a red herring and the real issue is somewhere else (or there is no problem and I'm nitpicking and 100k max throughput vs 120k with 1 thread is not such a big difference, and with 4 threads it's only a tad slower). However, I tried one more thing - what happens if I increase the number of concurrent queries to a crazy value, on the latte side. It is easy to do so with
Wait, what?! 24k req/s? 5x lower? Let's try that with master branch using cassandra_cpp:
Now Cassandra is also a lot more loaded, gets about 450-500% of a core, total idle is about 15-20% (still weird why, but it was always like that, so that might be a problem in Cassandra). So overall - it looks like there is a bottleneck somewhere when a crazy number of futures is outstanding. Can you try with -p 4096 (or even higher values) and check if you also reproduce that? Generally the higher you go, the throughput drops more, while with cassandra_cpp the throughput first goes up, then it flattens, but does not drop - further increasing the number of concurrent requests only increases the latency. |
Seems to be inversely linear. Going from 4096 to 8152 concurrent queries halves the throughput near exactly. I'm using single threaded tokio runtime here so it is likely not a classic thread contention. |
I made a flamegraph and it is hugely confusing to me. First it doesn't show that scylla request takes a huge amount of time, at least not directly. Also when I change the workload to I wonder, could the future returned form scylla execute be deeply nested - e.g. aren't we running into the problem described here: https://internals.rust-lang.org/t/questioning-the-design-of-the-future-trait-and-the-cost-implied-by-it/14217 ? |
Ok, so I think I got it. Indeedt the driver must be running into a recursive Before (session.rs: 206): let rs = self.inner.execute(statement, params).await; After: let session = self.inner.clone();
let statement = statement.clone();
let rs =
tokio::task::spawn(async move { session.execute(statement.as_ref(), params).await })
.await
.unwrap(); Such a tiny change and a 10x performance difference. |
Very impressive detective work! I just ran a simple experiment and can confirm that increasing the parallelism causes reduced throughput after a sweet spot is reached. I wonder if this problem (https://internals.rust-lang.org/t/questioning-the-design-of-the-future-trait-and-the-cost-implied-by-it/14217) is even solvable without explicitly spawning a task, we'll definitely need to educate ourselves more with Rust and Tokio internals to try and bypass this. /cc @piodul |
I mean, it obviously is solvable, but it would be great to find a way that doesn't involve rewriting the whole core of the driver. |
It's still quite interesting that adding a spawn in latte already leads to such reduction of the problem - it would mean that the depth of futures in the driver itself is not problematic on its own, because otherwise a spawn wouldn't help (tribute for this observation goes to @piodul), but once it's used from within an application that adds more depth - like latte, the problem gets amplified. I wonder if the cassandra-cpp backend wouldn't eventually experience the same issue, just much later, because the bindings make the depth of the resulting state machine much shorter. In any case, I did some quick experiments and it's evident that for large parallelism values, simplifying the state machine causes vast improvements - e.g. I just prepared a branch which ignores retry policy and speculative retries, and it immediately increases throughput by around 20% for parallelism ~2048 (I picked 2048, because my local machine already experiences a huge drop in throughput compared to lower values). |
Obviously I can't for sure rule out the O(n) complexity happens on latte side. However, I double checked the only place where multiple queries are processed is the main loop, which is pretty simple and standard - it uses https://docs.rs/futures/0.3.18/futures/stream/futures_unordered/struct.FuturesUnordered.html And the main loop is here: |
Right - I find it confusing as well, especially having read latte's source code. The only candidate for a culprit is indeed |
The main question is though, are those machines fixed size, or do they somehow grow with the number of queries submitted to the driver? According to the Rust forums thread you may run into an issue if there are recursive .await calls or if there is |
I tried to inspect the code rather thoroughly and didn't find any obvious recursive calls - retry policies are candidates, but dropping this path did not mitigate the issue. Also, if recursive awaits in the driver would be the root cause, I would assume that with sufficiently high parallelism you'd still hit the issue even after spawning a new task for each execute - since the driver side machines would still eventually grow big enough to cause noticeable overhead. |
Switched from The actual parallelism level went down, because that introduced a head-of-line blocking problem. For -p 8152 I'm getting:
Which is obviously far below expectations. Later I'll try to write a test with artificially large state machine and check if buffer_unordered shows the same weird O(N) behavior. |
I haven't verified if my hastily crafted patch is correct, but I'll present it below anyway. I dropped the idea of using
Unless I made some grievous design mistake, this approach works quite similarly to yours original code, with an exception that it makes the parallelism come in waves - a first batch must finish before the next one is scheduled. Still, this patch vastly speeds up I don't have any hard proof except for the observation above, but it kind of looks like diff --git a/src/count_down.rs b/src/count_down.rs
index 392325d..0ae836c 100644
--- a/src/count_down.rs
+++ b/src/count_down.rs
@@ -41,6 +41,10 @@ impl CountDown {
}
}
}
+
+ pub fn load(&self) -> u64 {
+ self.value.load(Ordering::Relaxed)
+ }
}
pub struct BatchedCountDown {
@@ -72,4 +76,8 @@ impl BatchedCountDown {
true
}
}
+
+ pub fn shared_count(&self) -> u64 {
+ self.shared.load()
+ }
}
diff --git a/src/main.rs b/src/main.rs
index 5b6f9bb..26aefcd 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -7,6 +7,7 @@ use std::time::{Duration, Instant};
use clap::Clap;
use futures::future::ready;
+use futures::stream::futures_unordered::FuturesUnordered;
use futures::{Future, SinkExt, Stream, StreamExt};
use scylla::Session;
use tokio::runtime::Builder;
@@ -92,34 +93,40 @@ where
let action = &action;
let mut remaining_count = BatchedCountDown::new(count, 64);
let pending_count = AtomicUsize::new(0);
- let mut req_stats = IntervalStream::new(interval(rate))
- .take_while(|_| ready(remaining_count.dec()))
- .enumerate()
- .map(|(i, _)| {
- let context = context.clone();
- let pending_count = &pending_count;
- async move {
- let queue_len = pending_count.fetch_add(1, Ordering::Relaxed);
- let start = Instant::now();
- let result = action(context, i as u64).await;
- let end = Instant::now();
- pending_count.fetch_sub(1, Ordering::Relaxed);
- RequestStats::from_result(result, end - start, queue_len)
- }
- })
- .buffer_unordered(parallelism);
let mut sample = Sample::new(Instant::now());
- while let Some(req) = req_stats.next().await {
- sample.record(req);
- let now = Instant::now();
- if now - sample.start_time > sampling_period {
- let start_time = sample.start_time;
- let elapsed_rounded = round(now - start_time, sampling_period);
- let end_time = start_time + elapsed_rounded;
- sample.finish(end_time);
- tx.send(sample).await.unwrap();
- sample = Sample::new(end_time);
+
+ while remaining_count.shared_count() > 0 {
+ let mut req_stats = IntervalStream::new(interval(rate))
+ .take_while(|_| ready(remaining_count.dec()))
+ .enumerate()
+ .map(|(i, _)| {
+ let context = context.clone();
+ let pending_count = &pending_count;
+ async move {
+ let queue_len = pending_count.fetch_add(1, Ordering::Relaxed);
+ let start = Instant::now();
+ let result = action(context, i as u64).await;
+ let end = Instant::now();
+ pending_count.fetch_sub(1, Ordering::Relaxed);
+ RequestStats::from_result(result, end - start, queue_len)
+ }
+ })
+ .take(parallelism)
+ .collect::<FuturesUnordered<_>>()
+ .await;
+
+ while let Some(req) = req_stats.next().await {
+ sample.record(req);
+ let now = Instant::now();
+ if now - sample.start_time > sampling_period {
+ let start_time = sample.start_time;
+ let elapsed_rounded = round(now - start_time, sampling_period);
+ let end_time = start_time + elapsed_rounded;
+ sample.finish(end_time);
+ tx.send(sample).await.unwrap();
+ sample = Sample::new(end_time);
+ }
}
}
if !sample.is_empty() { |
I'll test it tomorrow, but beware that your patch lowers the average concurrency level, because the number of pending futures will decrease down to zero in each wave. So obviously it would improve throughput even if the original problem remains. However, if you're seeing a huge improvement from that, it is even more weird, because |
It does lower average concurrency, but it comes in bursts, so if the problem is that high concurrency doesn't mix well with large state machines produced by compiling the driver, I think it would get amplified after the patch. The reasoning is that each time a burst of concurrency comes, it would suffer from reduced throughput, and the breaks between bursts only lower the throughput further, because nothing gets executed. But that's still just a guess. In any case, I keep pushing this theory because it's so confusing that adding a tokio::spawn in |
I did a test and it looks like
Here's the code use futures::stream::futures_unordered::FuturesUnordered;
use futures::stream::StreamExt;
use tokio::sync::Mutex;
use std::sync::Arc;
async fn work(mutex: Arc<Mutex<()>>) {
mutex.lock().await;
}
async fn do_test(n: usize) {
let start: std::time::Instant = std::time::Instant::now();
let mutex: Arc<Mutex<()>> = Arc::new(Mutex::new(()));
let mutex_guard = mutex.lock().await;
let mut futs = Vec::new();
for _ in 0..n {
futs.push(work(mutex.clone()));
}
let mut unorder: FuturesUnordered<_> = futs.into_iter().collect();
std::mem::drop(mutex_guard);
for _ in 0..n {
unorder.select_next_some().await;
}
let end: std::time::Instant = std::time::Instant::now();
println!("n: {}, time: {}ms", n, end.duration_since(start).as_millis());
}
#[tokio::main]
async fn main() {
for n in [10_000, 20_000, 40_000, 80_000, 160_000] {
do_test(n).await;
}
}
[dependencies]
tokio = { version = "1.12", features = ["full"] }
futures = "0.3.6" |
Does it though? It has mechanisms for not polling its futures more than once, but I don't see any mechanisms that prevent polling O(n) futures when FuturesOrdered are awakened, until it finds the one that's actually ready |
It looks like it has two queues:
When polling a managed future it gives it a custom This way it calls That should ensure that performance is fine o_0 |
Alright, I was looking at the source code for some ancient 0.1.x version... I keep doing that when browsing Rust docs. |
@cvybhu in any case, please post your reproducer and results on future's bug tracker: https://github.com/rust-lang/futures-rs/issues and link to this issue, perhaps it's a known thing |
Oh, we had a race here. Never mind :) |
Opened rust-lang/futures-rs#2526 |
@pkolaczk @cvybhu following the advice from rust-lang/futures-rs#2526 (comment), the patch below indeed not only makes the problem disappear completely, but also significantly speeds up the execution even for lower concurrency values for me! diff --git a/src/main.rs b/src/main.rs
index 5b6f9bb..cd80643 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -101,7 +101,7 @@ where
async move {
let queue_len = pending_count.fetch_add(1, Ordering::Relaxed);
let start = Instant::now();
- let result = action(context, i as u64).await;
+ let result = tokio::task::unconstrained(action(context, i as u64)).await;
let end = Instant::now();
pending_count.fetch_sub(1, Ordering::Relaxed);
RequestStats::from_result(result, end - start, queue_len)
|
Thanks for investigation. Wow, I'd never think about a trap like that! |
Wait, adding |
It will only block if the driver never yields, i.e. sockets are always ready and so on. For a benchmarking tool, turning off yielding might make sense, depending how it influences latency |
@cvybhu I'm reading that differently - it doesn't yield only if a future is ready, but it still yields if future is not ready. However, I fail to see how it is connected to O(n^2) complexity for |
Here's my understanding of what's going on: |
I added unconstrained and gave it a test. The results are different but still unsatisfactory. The quadratic complexity issue is indeed gone. Upping the parallelism level does not lower the throughput. What I noticed is that now regardless of the -p setting, the actual average number of pending futures is about 150-170, and hits some kind of a ceiling. -p 256: actual concurrency 116.3, throughput 114k cassandra_cpp is doing about 125-130k on the same server, with 1 thread, and the actual parallelism level is much higher - easily goes above 350. Note that these are extremely tiny statements on a link with extremely low latency, so parallelism must be high in order to fully load C* on the other side. If I increase the number of threads to 2, then I can almost match cassandra_cpp on 1 thread. I've got 2 hypotheses what's causing this:
BTW: if I remove the actual |
On Wed, Dec 1, 2021 at 9:25 AM Piotr Kołaczkowski ***@***.***> wrote:
I added unconstrained and gave it a test. The results are different but
still unsatisfactory.
The quadratic complexity issue is indeed gone. Upping the parallelism
level does not lower the throughput.
The max throughput went up by about 10-15%.
However, my max throughput is *still* lower than the one I can get from
cassandra_cpp (another 15% needed).
What I noticed is that now regardless of the -p setting, the actual
average number of pending futures is about 150-170, and hits some kind of a
ceiling.
-p 256: actual concurrency 116.3, throughput 114k
-p 512: actual concurrency 131.9, throughput 114k
-p 1000: actual concurrency 166.3, throughput 112k
cassandra_cpp is doing about 125-130k on the same server, with 1 thread,
and the actual parallelism level is much higher - easily goes above 350.
Note that these are extremely tiny statements on a link with extremely low
latency, so parallelism must be high in order to fully load C* on the other
side.
If I increase the number of threads to 2, then I can almost match
cassandra_cpp on 1 thread.
I've got 2 hypotheses what's causing this:
1.
now that I've added unconstrained, the producer gets starved and
simply can't deliver enough concurrent requests for execution; all power
goes into the driver to get them processed as fast as possible
2.
because cassandra_cpp has a separate libuv executor, it is not truly
single threaded; so it is not a true apple-to-apple comparison, as the
producer runs on a different thread than the I/O thread sending the
requests. So it might be simply the it is too much work to do for a single
thread, and scylla is using most time processing the request and not much
of the CPU core is left for producing the requests. However, CPU time seems
to disprove that - CPU time is not higher for cassandra_cpp at 130k req/s.
Interesting. You can pin the process to a single core and see if the
result changes
…
1.
BTW: if I remove the actual session.execute call and leave all the other
things in place (running the script, including parameter conversion), it
reaches over 610k req/s. So the producer is definitely fast enough for this
test. Obviously I can work on that part more but first I'd like to rule out
problem 1.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#362 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANHURKQCE6GHSSEPSZXXFDUOXEYHANCNFSM5I25XU3Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
That's a great idea!
cassandra_cpp:
scylla:
scylla 2x slower. Why? Something is wrong. I tried to swap the futures crate to that new branch with a fix instead of using |
And I verified CPU frequency was properly locked at 3 GHz during both runs. |
It looks like it sends one request at a time:
I added |
Oh, man, another trap here:
Sp it looks like indeed the producer is starved - it can add only one new request per each write to the channel and write buffering doesn't work properly (and we already know write buffering is essential to get good perf here). And So I guess the way to go is to wait for |
I managed to test that by temporarily dropping |
Well, one more thing I tried - wrapping the |
I tried to debug how buffer_unordered actually polls scylla futures and got a weird observation:
Something fishy must be going on on the latte side then, stay tuned. |
Looks like there is a huge difference just stemming from launching on another thread: let rt = Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
std::thread::spawn(move || {
rt.block_on(async move {
println!("Stream started");
let mut s = futures::stream::iter(0..10)
.map(|i| DebugFuture::new(format!("q {}", i), loader.run(i)) )
.buffer_unordered(5);
while let Some(result) = s.next().await {
println!("Got query result: {:?}", result);
}
println!("Stream ended");
});
});
tokio::time::sleep(Duration::from_secs(100)).await; Output:
But the same code executed on the main (also singlethreaded) runtime, without spawining additional thread: println!("Stream started");
let mut s = futures::stream::iter(0..10)
.map(|i| DebugFuture::new(format!("q {}", i), loader.run(i)) )
.buffer_unordered(5);
while let Some(result) = s.next().await {
println!("Got query result: {:?}", result);
}
println!("Stream ended");
Any ideas why could this be? I believe I should figure out how to avoid running my own threads and just use a single multithreaded RT instead. |
Note sure if this is the right issue to comment on, but I was quite surprised to see that cdrs_tokio is quicker than this crate in most scenario's according to the benchmarks provided in the readme: https://github.com/krojew/cdrs-tokio. Maybe interesting to see how the maintainer managed to make it perform so quick. |
@Jasperav the benchmarks presented in the readme do not mention any of the following:
so it's hard to deduce anything from the output. For any reproducible benchmarks we'd be happy to rerun and verify them, and do our best to match and/or outperform other drivers, if they prove to be faster. |
After I switched to a single tokio runtime I'm getting way better results now and buffering on Scylla 0.3.1 works properly in that scenario, even with Running on a single executor has also this nice property that now when I specify I have also found that adding To summarize:
Feel free to close the ticket. @Jasperav there are a lot of different things that can affect benchmark results. |
FYI, FuturesUnordered is likely to be able to detect if a future wants to cooperatively yield soon: rust-lang/futures-rs#2551 . The nice trick is checking if it's the task that wants to wake itself, vs when something else wants to wake it. |
@pkolaczk if the problem persists in the newest version of the driver, please we'll reopen this. |
So I ported current master of latte to scylladb driver, and I noticed a weird issue - minimum and average latencies are much higher than cassandra_cpp, and throughput is still a bit behind in this test (even though this test is short, this phenomenon exist also when running it for longer). I've run it a few times, also in reversed order and it is very repeatable. The difference is larger on single thread.
Are there any tuning options for the buffering thing you've added recently?
BTW: not sure why the number of returned rows is only half the number of queries, but that looks like a bug in the workload generation / definition; that's exactly same code on both sides, so shouldn't change the validity of this test.
The text was updated successfully, but these errors were encountered: