perf: kv falls off a cliff at ~12 nodes #26178

benesch · 2018-05-29T19:53:44Z

Yesterday I discovered that kv performance falls off a cliff somewhere between 12 and 16 nodes in a cluster. Doing some more testing overnight I produced the chart above

Running a similar test on a 6-node cluster produces a smooth throughput curve.

similar increases in concurrency on smaller clusters don’t show the same behavior which implies this isn’t a number of goroutines / conns problem. possibly network.

(We expect the throughput to smoothly level off while the latency increases.)

benesch · 2018-06-04T18:21:59Z

Generation instructions:

$ roachprod test benesch-test-kv --binary ./cockroach kv_95 -c 1-64/64
$ roachprod web kv_95.cockroach*

For a smoother graph, turn the concurrency to 1-64/16 (the default).

I'm using normal roachprod VMs and have yet to repro the badness.

benesch · 2018-06-04T19:50:13Z

n1-highcpu-16 did the trick. Consider this reproduced.

nvanbenschoten · 2018-08-28T18:41:14Z

@m-schneider and I have our first lead. All of the graphs on the adminUI and the CPU and heap profiles looked fine. Everything was either equal or actually larger on the cluster with concurrency before the cliff (scaled exactly to the delta in throughput). The blocking profiles also looked similar until we ignored blocking in Go select blocks. Then we saw this:

600 concurrency

700 concurrency

On the very left of the 700 concurrency flame graph we can see 4.22% of the blocking is in StmtBuf.curCmd. This same method is only responsible for .14% of blocking in the 600 concurrency case. StmtBuf.curCmd is called by a sql connection's connExecutor goroutine when it is ready for another statement. A method waits for the corresponding pgwire goroutine to populate the buffer with a new statement. I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there. It's also possible that this is a slowdown in the kv workload itself. We're going to continue digging in there.

@m-schneider is also going to run the same experiment on AWS to see if she's able to reproduce the results.

m-schneider · 2018-08-28T18:42:49Z

After taking another look this looks very reproducible on a 16 node n1-highcpu-16 cluster. After seeing the same behavior as @benesch I ran a version of the test that incremented concurrency a bit slower and found that the drop off is generally around a concurrency of 650

Then I tried the same test with a 24 node cluster and also saw a very similar drop off though marginally later

nvanbenschoten · 2018-08-28T18:48:01Z

I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there.

@a-robinson you have a lot of knowledge about networking. I'm interested in how you'd recommend investigating something like this.

petermattis · 2018-08-28T18:59:45Z

Is the worker machine overloaded or at capacity? Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd? We could add some instrumentation to check what that latency is?

I've used iftop in the past to look at network stats on the machine.

nvanbenschoten · 2018-08-28T19:05:02Z

Is the worker machine overloaded or at capacity

No, at the higher concurrency the cpu utilization actually drops because of the reduced throughput.

Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd?

That's possible, although it would surprise me if such an issue reliably only showed up above a very specific concurrency threshold.

I've used iftop in the past to look at network stats on the machine.

Thanks for the tip!

nvanbenschoten · 2018-08-29T04:49:57Z

I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in runtime.schedule->runtime.findrunnable. Below this cliff this accounts for 5-6% of the profile, above that number doubles to around 11%. This is pretty significant, although I don't know exactly what to make of it right now. We expect the number of goroutines to grow proportionally with the number of SQL connections, but runtime.findrunnable doubling due to a ~20% increase in SQL connections seems awfully suspicious.

On a whim, I began tweaking StmtBuf to see if we were on to anything with the blocking in StmtBuf.curCmd. StmtBuf currently uses a slice that's shared between the producer (pgwire conn goroutine) and consumer (executor goroutine). It uses a condition variable to signal updates to the buffer and coordinate between the two goroutines. One theory I had was that this cond var may be sub-optimal in terms of quickly preempting the producer and scheduling the consumer whenever it is signalled. I ran an experiemnt where I switched the StmtBuf to use a buffered channel instead of the condition variable: nvanbenschoten@7408539. I like the change as it both simplifies the code and comes off as more idiomatic, but unfortunately it didn't actually have any impact on performance. The blocking contribution due to StmtBuf.curCmd's Cond.Wait call was simply replaced by roughly the same contribution from StmtBuf.curCmd's new runtime.selectgo call. It's possible that making this an unbuffered channel would have an effect, but that wasn't as easy of a change.

I also tested with calling runtime.Gosched immediately after signalling the condition variable, but that again had no effect.

I'm interested to see the results of @m-schneider's investigation into whether this is new since connExecutor refactor. This will give us some indication of whether this is a scheduling issue at the pgwire/connExecutor boundary or whether it might be a scheduling issue at the network/pgwire boundary.

benesch · 2018-08-29T04:59:12Z

One other interesting experiment would be short-circuiting everything. If you hacked up your cockroach binary to return “no results” for every external query instantly, you’d be able to zero in on whether this is a network problem, load generator problem, or actual Cockroach problem.

…

On Wed, Aug 29, 2018 at 12:50 AM Nathan VanBenschoten < ***@***.***> wrote: I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in runtime.schedule->runtime.findrunnable. Below this cliff this accounts for 5-6% of the profile, above that number doubles to around 11%. This is pretty significant, although I don't know exactly what to make of it right now. We expect the number of goroutines to grow proportionally with the number of SQL connections, but runtime.findrunnable doubling due to a ~20% increase in SQL connections seems awfully suspicious. On a whim, I began tweaking StmtBuf to see if we were on to anything with the blocking in StmtBuf.curCmd. StmtBuf currently uses a slice that's shared between the producer (pgwire conn goroutine) and consumer (executor goroutine). It uses a condition variable to signal updates to the buffer and coordinate between the two goroutines. One theory I had was that this cond var may be sub-optimal in terms of quickly preempting the producer and scheduling the consumer whenever it is signalled. I ran an experiemnt where I switched the StmtBuf to use a buffered channel instead of the condition variable: ***@***.*** <nvanbenschoten@7408539>. I like the change as it both simplifies the code and comes off as more idiomatic, but unfortunately it didn't actually have any impact on performance. The blocking contribution due to StmtBuf.curCmd's Cond.Wait call was simply replaced by roughly the same contribution from StmtBuf.curCmd's new runtime.selectgo call. It's possible that making this an unbuffered channel would have an effect, but that wasn't as easy of a change. I also tested with calling runtime.Gosched immediately after signalling the condition variable, but that again had no effect. I'm interested to see the results of @m-schneider <https://github.com/m-schneider>'s investigation into whether this is new since connExecutor refactor. This will give us some indication of whether this is a scheduling issue at the pgwire/connExecutor boundary or whether it might be a scheduling issue at the network/pgwire boundary. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26178 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA15IKsPXDd0awSV4wGt_B8AR0X0yjR3ks5uVh2JgaJpZM4USHmA> .

nvanbenschoten · 2018-08-29T05:06:40Z

This seems somewhat related to golang/go#18237 and https://groups.google.com/forum/#!topic/golang-nuts/6zKXeCoT2LM. My main takeaway from this so far is that having goroutine coordination on the hot path of SQL statement execution is unfortunate. I wonder how hard it would be to rip out the second goroutine for the sake of experimentation.

cc. @andreimatei.

nvanbenschoten · 2018-08-29T05:07:57Z

One other interesting experiment would be short-circuiting everything. If
you hacked up your cockroach binary to return “no results” for every
external query instantly, you’d be able to zero in on whether this is a
network problem, load generator problem, or actual Cockroach problem.

That's an interesting idea which is worth exploring.

andreimatei · 2018-08-29T15:15:22Z

cc @andy-kimball as he was also looking at the impact of that goroutine bounce in the executor a while ago.

petermattis · 2018-08-29T15:56:21Z

One other interesting experiment would be short-circuiting everything. If
you hacked up your cockroach binary to return “no results” for every
external query instantly, you’d be able to zero in on whether this is a
network problem, load generator problem, or actual Cockroach problem.

Short-circuiting everything can prevent the load generator from running (i.e. you need CREATE TABLE to succeed). In the past what I had to do was to put a simple textual recognition of the load generator query and only short-circuit execution of that query. It was super interesting to see how fast pgwire worked. Definitely worthwhile to do this again with the new conn executor infrastructure.

m-schneider · 2018-08-29T21:48:31Z

After running the test on older versions of cockroach, it looks like the cliff already existed as early as the Dec 18th alpha.

However even though overall throughput was lower in the October alpha, there doesn't seem to be an actual cliff.

Will keep bisecting.

m-schneider · 2018-09-19T18:57:25Z

After git bisecting the improvement in throughput can be attributed to bcbde02. We're currently adding metrics to see if the new data structure could also be causing the cliff.

m-schneider · 2018-09-20T20:22:58Z

After instrumenting the time series cache, there doesn't seem to be any increase in latency coming from the skiplist implementation with increased concurrency:

nvanbenschoten · 2018-09-30T20:58:57Z

I spent some time looking into this after observing poor throughput scaling in sysbench as concurrency grew.

I was able to reproduce the behavior observed here by spinning up a 16 node cluster with n1-highcpu-16 machines. Instead of using roachprod test, I ran roachprod run <cluster>:17 -- ./workload run kv --init --read-percent=100 --write-seq=1000000 --splits=1000 --concurrency=<concurrency> --duration=5m {pgurl:1-16} directly, and alternated between a concurrency of 400 and 4000. When concurrency was at 400, throughput hovered around 200k. When concurrency was at 4000, throughput dropped to 120k. Future experimentation here should use this approach, as it makes it much easier to switch back and forth between the good and bad performance regimes.

I started by looking at performance profiles. The CPU profile showed very little difference between the two runs. Neither did the heap profile. Blocking profiles showed extra blocking in StmtBuf.curCmd, as we observed above. Mutex profiles were inconsistent and I didn't get any real insight out of them.

I then turned to the Go execution tracer. This is where things got interesting. The first execution trace is with --concurrency=400. The second is with --concurrency=4000.

concurrency=400

concurrency=4000

There are a number of things that jump out from these traces:

processor utilization is severly degraded in the second trace
the GC is significantly longer in the first trace (because of the higher throughput)
there are more runnable goroutines and running threads in the first trace
there are more timer events and syscalls in the first trace
there are significantly more network events in the second trace

Out of these two traces, it's pretty clear that the qualities of the first is more conducive of high throughput. So what gives? Why the degraded processor utilization in the second trace?

I turned to the execution trace's built-in profiles to try to answer this. My first stop was the network blocking profile. This didn't provide too much insight. Both profiles showed 100% of network blocking in netFD.Read. There was one difference though. In the good profile (conc=400), 93% of this blocking was attributed to the pgwire connection while 7% was attributed to the inter-node gRPC connection. In the bad profile (conc=4000), 100% was attributed to the pgwire connection.

My next stop was the execution trace's scheduler latency profile. This was more interesting. The good profile attributed 66% of scheduler latency to Mutex unlocking. The two main callers of this were Clock.Now (37%) and Clock.PhysicalNow (20%, avoidable, see #30520). The rest of the latency was attributed to runtime.netpolldeadlineimpl's call to runtime.ready. Things were completely different in the bad profile. Over 50% of the latency was attributed to runtime.selectnbsend, whose primary caller was controlBuffer.executeAndPut. If we focus in on this function, we see that its main caller (~70% of calls) is http2Client.NewStream, which is called exlcusively by roachpb/internalClient.Batch. I went back to the good profile and found that selectnbsend is only attributed with0.032% of scheduler latency.

The differences here are stark and I think it's safe to conclude that this is contributing to the reduced throughput with higher client concurrency. The question now is what to do about it. I still need to look more at this to understand why we only see the issue with a higher concurrency, but the first thing that comes to mind is that we create a separate gRPC stream for each internalClient.Batch call. This is something @a-robinson and @petermattis have considered in the past, so I'm curious if they have any insight here.

Another interesting thing to note is that we can see in controlBuffer.executeAndPut that it only sends on controlBuffer.ch if consumerWaiting == true. I think we might be hitting some inflextion point where consumers start waiting on the buffer, which causes things to slow down even further. This would explain the cliff-like behavior we see in baseline throughput. That's just speculation though.

I've included the two scheduler profiles here: scheduler_profiles.zip.

nvanbenschoten · 2018-09-30T21:02:27Z

My next step is to lean how all this controlBuffer stuff works and how it interacts with gRPC's loopyWriter.

petermattis · 2018-09-30T21:26:30Z

I believe @a-robinson investigated having a pool of streams for internalClient.Batch calls, but there is some difficult in properly handling context cancellation or something like that. I think this is documented somewhere, though I can't find it right now.

The loopyWriter stuff is all new since the last time I looked at gRPC. The signaling in controlBuffer.executeAndPut and controlBuffer.get is interesting. They seem to be using controlBuffer.ch as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?

In cockroachdb#26178, we saw that throughput hit a cliff while running `kv` at high concurrency levels. We spent a while debugging the issue, but nothing stood out in the `cockroach` process. Eventually I installed pprof http handlers in `workload` (cockroachdb#30810). The CPU and heap profiles looked fine but the mutex profile revealed that **99.94%** of mutex contention was in `sql.(*Rows).Next`. It turns out that this method manipulates a lock that's scoped to the same degree as its prepared statement. Since `readStmt` was prepared on the `sql.DB`, all kvOps were contending on the same lock in `sql.(*Rows).Next`. The fix is to give each `kvOp` its own `sql.Conn` and prepare the statement with a connection-level scope. There are probably other areas in `workload` that could use the same kind of change. Before this change, `kv100 --concurrency=400` in the configuration discussed in cockroachdb#26178 topped out at around 80,000 qps. After this change, it tops out at around 250,000 qps. Release note: None

nvanbenschoten · 2018-10-01T02:30:04Z

They seem to be using controlBuffer.ch as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?

I looked into this and found that they can't easily use a cond variable there because they also want to wait on a context cancellation. Without adjusting other layers to catch the context cancellation and without golang/go#16620, there's not an easy path to making the replacement.

But that doesn't matter now because... here's the fix: #30811.

nvanbenschoten · 2018-10-01T02:58:57Z

To wrap this all up, it turns out that CPU utilization was low because the client was bottlenecking itself. There wasn't anything going wrong in cockroach itself, it just wasn't getting a lot of traffic.

I began suspecting the client after running Cockroach with GODEBUG=schedtrace=1000 and seeing:

fewer threads with the higher concurrency
more idle threads with higher concurrency
a near-empty global run queue at both concurrency levels
empty local run queues at both concurrency levels

For instance:

concurrency=400

SCHED 58482ms: gomaxprocs=16 idleprocs=0 threads=47 spinningthreads=3 idlethreads=9 runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

concurrency=4000

SCHED 21150ms: gomaxprocs=16 idleprocs=13 threads=43 spinningthreads=0 idlethreads=17 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Combined with the execution traces above, it became apparent that goroutines weren't taking a particularly long time to be scheduled, there just weren't many goroutines to schedule.

tbg · 2018-10-01T08:37:06Z

Good job. Lots of prime lunch and learn material, too.

In cockroachdb#26178, we saw that throughput hit a cliff while running `kv` at high concurrency levels. We spent a while debugging the issue, but nothing stood out in the `cockroach` process. Eventually I installed pprof http handlers in `workload` (cockroachdb#30810). The CPU and heap profiles looked fine but the mutex profile revealed that **99.94%** of mutex contention was in `sql.(*Rows).Next`. It turns out that this method manipulates a lock that's scoped to the same degree as its prepared statement. Since `readStmt` was prepared on the `sql.DB`, all kvOps were contending on the same lock in `sql.(*Rows).Next`. The fix is to give each `kvOp` its own `sql.Conn` and prepare the statement with a connection-level scope. There are probably other areas in `workload` that could use the same kind of change. Before this change, `kv100 --concurrency=400` in the configuration discussed in cockroachdb#26178 topped out at around 80,000 qps. After this change, it tops out at around 250,000 qps. Release note: None

30811: workload: give each kvOp a separate sql.Conn r=nvanbenschoten a=nvanbenschoten In #26178, we saw that throughput hit a cliff while running `kv` at high concurrency levels. We spent a while debugging the issue, but nothing stood out in the `cockroach` process. Eventually, I installed pprof http handlers in `workload` (#30810). The CPU and heap profiles looked fine but the mutex profile revealed that **99.94%** of mutex contention was in `sql.(*Rows).Next`. It turns out that this method manipulates a lock that's scoped to the same degree as its prepared statement. Since `readStmt` was prepared on the `sql.DB`, all kvOps were contending on the same lock in `sql.(*Rows).Next`. The fix is to give each `kvOp` its own `sql.Conn` and prepare the statement with a connection-level scope. There are probably other places in `workload` that could use the same kind of change. Before this change, `kv100 --concurrency=400` in the configuration discussed in #26178 topped out at around 80,000 qps. After this change, it tops out at around 250,000 qps. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

benesch added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. labels May 29, 2018

benesch self-assigned this May 29, 2018

benesch mentioned this issue Jun 4, 2018

Makefile,pkg/ui: fix compilation of JS CCL protobufs #26148

Merged

nvanbenschoten assigned m-schneider Jul 19, 2018

nvanbenschoten added this to the 2.1 milestone Jul 22, 2018

nvanbenschoten unassigned benesch Aug 21, 2018

nvanbenschoten mentioned this issue Sep 14, 2018

perf: reconsider crypto/rand use in UUID allocation for txn IDs #30236

Closed

nvanbenschoten mentioned this issue Oct 1, 2018

workload: give each kvOp a separate sql.Conn #30811

Merged

nvanbenschoten mentioned this issue Oct 1, 2018

sql: improve precision of index spans to scan for multi-column-family single-row fetches #18168

Closed

nvanbenschoten closed this as completed Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: kv falls off a cliff at ~12 nodes #26178

perf: kv falls off a cliff at ~12 nodes #26178

benesch commented May 29, 2018 •

edited

Loading

benesch commented Jun 4, 2018 •

edited

Loading

benesch commented Jun 4, 2018

nvanbenschoten commented Aug 28, 2018

m-schneider commented Aug 28, 2018

nvanbenschoten commented Aug 28, 2018

petermattis commented Aug 28, 2018

nvanbenschoten commented Aug 28, 2018

nvanbenschoten commented Aug 29, 2018

benesch commented Aug 29, 2018 via email

nvanbenschoten commented Aug 29, 2018

nvanbenschoten commented Aug 29, 2018

andreimatei commented Aug 29, 2018

petermattis commented Aug 29, 2018

m-schneider commented Aug 29, 2018

m-schneider commented Sep 19, 2018

m-schneider commented Sep 20, 2018

nvanbenschoten commented Sep 30, 2018

nvanbenschoten commented Sep 30, 2018

petermattis commented Sep 30, 2018

nvanbenschoten commented Oct 1, 2018

nvanbenschoten commented Oct 1, 2018 •

edited

Loading

tbg commented Oct 1, 2018

perf: kv falls off a cliff at ~12 nodes #26178

perf: kv falls off a cliff at ~12 nodes #26178

Comments

benesch commented May 29, 2018 • edited Loading

benesch commented Jun 4, 2018 • edited Loading

benesch commented Jun 4, 2018

nvanbenschoten commented Aug 28, 2018

600 concurrency

700 concurrency

m-schneider commented Aug 28, 2018

nvanbenschoten commented Aug 28, 2018

petermattis commented Aug 28, 2018

nvanbenschoten commented Aug 28, 2018

nvanbenschoten commented Aug 29, 2018

benesch commented Aug 29, 2018 via email

nvanbenschoten commented Aug 29, 2018

nvanbenschoten commented Aug 29, 2018

andreimatei commented Aug 29, 2018

petermattis commented Aug 29, 2018

m-schneider commented Aug 29, 2018

m-schneider commented Sep 19, 2018

m-schneider commented Sep 20, 2018

nvanbenschoten commented Sep 30, 2018

nvanbenschoten commented Sep 30, 2018

petermattis commented Sep 30, 2018

nvanbenschoten commented Oct 1, 2018

nvanbenschoten commented Oct 1, 2018 • edited Loading

tbg commented Oct 1, 2018

benesch commented May 29, 2018 •

edited

Loading

benesch commented Jun 4, 2018 •

edited

Loading

nvanbenschoten commented Oct 1, 2018 •

edited

Loading