Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: kv falls off a cliff at ~12 nodes #26178

Closed
benesch opened this issue May 29, 2018 · 22 comments
Closed

perf: kv falls off a cliff at ~12 nodes #26178

benesch opened this issue May 29, 2018 · 22 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@benesch
Copy link
Contributor

benesch commented May 29, 2018

screen shot 2018-03-28 at 6 50 27 am

@petermattis, from Slack:

Yesterday I discovered that kv performance falls off a cliff somewhere between 12 and 16 nodes in a cluster. Doing some more testing overnight I produced the chart above

Running a similar test on a 6-node cluster produces a smooth throughput curve.

similar increases in concurrency on smaller clusters don’t show the same behavior which implies this isn’t a number of goroutines / conns problem. possibly network.

(We expect the throughput to smoothly level off while the latency increases.)

@benesch benesch added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. labels May 29, 2018
@benesch benesch self-assigned this May 29, 2018
@benesch
Copy link
Contributor Author

benesch commented Jun 4, 2018

Generation instructions:

$ roachprod test benesch-test-kv --binary ./cockroach kv_95 -c 1-64/64
$ roachprod web kv_95.cockroach*

For a smoother graph, turn the concurrency to 1-64/16 (the default).

I'm using normal roachprod VMs and have yet to repro the badness.

screen shot 2018-06-04 at 2 49 14 pm

@benesch
Copy link
Contributor Author

benesch commented Jun 4, 2018

n1-highcpu-16 did the trick. Consider this reproduced.

screen shot 2018-06-04 at 3 49 43 pm

@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Jul 22, 2018
@tbg tbg added A-coreperf and removed A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. A-kv-distribution Relating to rebalancing and leasing. A-kv-client Relating to the KV client and the KV interface. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@nvanbenschoten
Copy link
Member

@m-schneider and I have our first lead. All of the graphs on the adminUI and the CPU and heap profiles looked fine. Everything was either equal or actually larger on the cluster with concurrency before the cliff (scaled exactly to the delta in throughput). The blocking profiles also looked similar until we ignored blocking in Go select blocks. Then we saw this:

600 concurrency

screen shot 2018-08-28 at 2 28 06 pm

700 concurrency

screen shot 2018-08-28 at 2 28 14 pm


On the very left of the 700 concurrency flame graph we can see 4.22% of the blocking is in StmtBuf.curCmd. This same method is only responsible for .14% of blocking in the 600 concurrency case. StmtBuf.curCmd is called by a sql connection's connExecutor goroutine when it is ready for another statement. A method waits for the corresponding pgwire goroutine to populate the buffer with a new statement. I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there. It's also possible that this is a slowdown in the kv workload itself. We're going to continue digging in there.

@m-schneider is also going to run the same experiment on AWS to see if she's able to reproduce the results.

@m-schneider
Copy link
Contributor

After taking another look this looks very reproducible on a 16 node n1-highcpu-16 cluster. After seeing the same behavior as @benesch I ran a version of the test that incremented concurrency a bit slower and found that the drop off is generally around a concurrency of 650
image

Then I tried the same test with a 24 node cluster and also saw a very similar drop off though marginally later
image

@nvanbenschoten
Copy link
Member

I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there.

@a-robinson you have a lot of knowledge about networking. I'm interested in how you'd recommend investigating something like this.

@petermattis
Copy link
Collaborator

Is the worker machine overloaded or at capacity? Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd? We could add some instrumentation to check what that latency is?

I've used iftop in the past to look at network stats on the machine.

@nvanbenschoten
Copy link
Member

Is the worker machine overloaded or at capacity

No, at the higher concurrency the cpu utilization actually drops because of the reduced throughput.

Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd?

That's possible, although it would surprise me if such an issue reliably only showed up above a very specific concurrency threshold.

I've used iftop in the past to look at network stats on the machine.

Thanks for the tip!

@nvanbenschoten
Copy link
Member

I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in runtime.schedule->runtime.findrunnable. Below this cliff this accounts for 5-6% of the profile, above that number doubles to around 11%. This is pretty significant, although I don't know exactly what to make of it right now. We expect the number of goroutines to grow proportionally with the number of SQL connections, but runtime.findrunnable doubling due to a ~20% increase in SQL connections seems awfully suspicious.

On a whim, I began tweaking StmtBuf to see if we were on to anything with the blocking in StmtBuf.curCmd. StmtBuf currently uses a slice that's shared between the producer (pgwire conn goroutine) and consumer (executor goroutine). It uses a condition variable to signal updates to the buffer and coordinate between the two goroutines. One theory I had was that this cond var may be sub-optimal in terms of quickly preempting the producer and scheduling the consumer whenever it is signalled. I ran an experiemnt where I switched the StmtBuf to use a buffered channel instead of the condition variable: nvanbenschoten@7408539. I like the change as it both simplifies the code and comes off as more idiomatic, but unfortunately it didn't actually have any impact on performance. The blocking contribution due to StmtBuf.curCmd's Cond.Wait call was simply replaced by roughly the same contribution from StmtBuf.curCmd's new runtime.selectgo call. It's possible that making this an unbuffered channel would have an effect, but that wasn't as easy of a change.

I also tested with calling runtime.Gosched immediately after signalling the condition variable, but that again had no effect.

I'm interested to see the results of @m-schneider's investigation into whether this is new since connExecutor refactor. This will give us some indication of whether this is a scheduling issue at the pgwire/connExecutor boundary or whether it might be a scheduling issue at the network/pgwire boundary.

@benesch
Copy link
Contributor Author

benesch commented Aug 29, 2018 via email

@nvanbenschoten
Copy link
Member

This seems somewhat related to golang/go#18237 and https://groups.google.com/forum/#!topic/golang-nuts/6zKXeCoT2LM. My main takeaway from this so far is that having goroutine coordination on the hot path of SQL statement execution is unfortunate. I wonder how hard it would be to rip out the second goroutine for the sake of experimentation.

cc. @andreimatei.

@nvanbenschoten
Copy link
Member

One other interesting experiment would be short-circuiting everything. If
you hacked up your cockroach binary to return “no results” for every
external query instantly, you’d be able to zero in on whether this is a
network problem, load generator problem, or actual Cockroach problem.

That's an interesting idea which is worth exploring.

@andreimatei
Copy link
Contributor

cc @andy-kimball as he was also looking at the impact of that goroutine bounce in the executor a while ago.

@petermattis
Copy link
Collaborator

One other interesting experiment would be short-circuiting everything. If
you hacked up your cockroach binary to return “no results” for every
external query instantly, you’d be able to zero in on whether this is a
network problem, load generator problem, or actual Cockroach problem.

Short-circuiting everything can prevent the load generator from running (i.e. you need CREATE TABLE to succeed). In the past what I had to do was to put a simple textual recognition of the load generator query and only short-circuit execution of that query. It was super interesting to see how fast pgwire worked. Definitely worthwhile to do this again with the new conn executor infrastructure.

@m-schneider
Copy link
Contributor

After running the test on older versions of cockroach, it looks like the cliff already existed as early as the Dec 18th alpha.
image
However even though overall throughput was lower in the October alpha, there doesn't seem to be an actual cliff.
image

Will keep bisecting.

@m-schneider
Copy link
Contributor

After git bisecting the improvement in throughput can be attributed to bcbde02. We're currently adding metrics to see if the new data structure could also be causing the cliff.

@m-schneider
Copy link
Contributor

After instrumenting the time series cache, there doesn't seem to be any increase in latency coming from the skiplist implementation with increased concurrency:

image

@nvanbenschoten
Copy link
Member

I spent some time looking into this after observing poor throughput scaling in sysbench as concurrency grew.

I was able to reproduce the behavior observed here by spinning up a 16 node cluster with n1-highcpu-16 machines. Instead of using roachprod test, I ran roachprod run <cluster>:17 -- ./workload run kv --init --read-percent=100 --write-seq=1000000 --splits=1000 --concurrency=<concurrency> --duration=5m {pgurl:1-16} directly, and alternated between a concurrency of 400 and 4000. When concurrency was at 400, throughput hovered around 200k. When concurrency was at 4000, throughput dropped to 120k. Future experimentation here should use this approach, as it makes it much easier to switch back and forth between the good and bad performance regimes.

I started by looking at performance profiles. The CPU profile showed very little difference between the two runs. Neither did the heap profile. Blocking profiles showed extra blocking in StmtBuf.curCmd, as we observed above. Mutex profiles were inconsistent and I didn't get any real insight out of them.

I then turned to the Go execution tracer. This is where things got interesting. The first execution trace is with --concurrency=400. The second is with --concurrency=4000.

screen shot 2018-09-30 at 4 20 15 pm

concurrency=400

screen shot 2018-09-30 at 4 20 36 pm

concurrency=4000

There are a number of things that jump out from these traces:

  • processor utilization is severly degraded in the second trace
  • the GC is significantly longer in the first trace (because of the higher throughput)
  • there are more runnable goroutines and running threads in the first trace
  • there are more timer events and syscalls in the first trace
  • there are significantly more network events in the second trace

Out of these two traces, it's pretty clear that the qualities of the first is more conducive of high throughput. So what gives? Why the degraded processor utilization in the second trace?

I turned to the execution trace's built-in profiles to try to answer this. My first stop was the network blocking profile. This didn't provide too much insight. Both profiles showed 100% of network blocking in netFD.Read. There was one difference though. In the good profile (conc=400), 93% of this blocking was attributed to the pgwire connection while 7% was attributed to the inter-node gRPC connection. In the bad profile (conc=4000), 100% was attributed to the pgwire connection.

My next stop was the execution trace's scheduler latency profile. This was more interesting. The good profile attributed 66% of scheduler latency to Mutex unlocking. The two main callers of this were Clock.Now (37%) and Clock.PhysicalNow (20%, avoidable, see #30520). The rest of the latency was attributed to runtime.netpolldeadlineimpl's call to runtime.ready. Things were completely different in the bad profile. Over 50% of the latency was attributed to runtime.selectnbsend, whose primary caller was controlBuffer.executeAndPut. If we focus in on this function, we see that its main caller (~70% of calls) is http2Client.NewStream, which is called exlcusively by roachpb/internalClient.Batch. I went back to the good profile and found that selectnbsend is only attributed with0.032% of scheduler latency.

The differences here are stark and I think it's safe to conclude that this is contributing to the reduced throughput with higher client concurrency. The question now is what to do about it. I still need to look more at this to understand why we only see the issue with a higher concurrency, but the first thing that comes to mind is that we create a separate gRPC stream for each internalClient.Batch call. This is something @a-robinson and @petermattis have considered in the past, so I'm curious if they have any insight here.

Another interesting thing to note is that we can see in controlBuffer.executeAndPut that it only sends on controlBuffer.ch if consumerWaiting == true. I think we might be hitting some inflextion point where consumers start waiting on the buffer, which causes things to slow down even further. This would explain the cliff-like behavior we see in baseline throughput. That's just speculation though.

I've included the two scheduler profiles here: scheduler_profiles.zip.

@nvanbenschoten
Copy link
Member

My next step is to lean how all this controlBuffer stuff works and how it interacts with gRPC's loopyWriter.

@petermattis
Copy link
Collaborator

I believe @a-robinson investigated having a pool of streams for internalClient.Batch calls, but there is some difficult in properly handling context cancellation or something like that. I think this is documented somewhere, though I can't find it right now.

The loopyWriter stuff is all new since the last time I looked at gRPC. The signaling in controlBuffer.executeAndPut and controlBuffer.get is interesting. They seem to be using controlBuffer.ch as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 1, 2018
In cockroachdb#26178, we saw that throughput hit a cliff while running `kv` at
high concurrency levels. We spent a while debugging the issue, but
nothing stood out in the `cockroach` process. Eventually I installed
pprof http handlers in `workload` (cockroachdb#30810). The CPU and heap profiles
looked fine but the mutex profile revealed that **99.94%** of mutex
contention was in `sql.(*Rows).Next`.

It turns out that this method manipulates a lock that's scoped to
the same degree as its prepared statement. Since `readStmt` was
prepared on the `sql.DB`, all kvOps were contending on the same lock
in `sql.(*Rows).Next`.

The fix is to give each `kvOp` its own `sql.Conn` and prepare the
statement with a connection-level scope. There are probably other areas
in `workload` that could use the same kind of change.

Before this change, `kv100 --concurrency=400` in the configuration
discussed in cockroachdb#26178 topped out at around 80,000 qps. After this change,
it tops out at around 250,000 qps.

Release note: None
@nvanbenschoten
Copy link
Member

They seem to be using controlBuffer.ch as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?

I looked into this and found that they can't easily use a cond variable there because they also want to wait on a context cancellation. Without adjusting other layers to catch the context cancellation and without golang/go#16620, there's not an easy path to making the replacement.

But that doesn't matter now because... here's the fix: #30811.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Oct 1, 2018

To wrap this all up, it turns out that CPU utilization was low because the client was bottlenecking itself. There wasn't anything going wrong in cockroach itself, it just wasn't getting a lot of traffic.

I began suspecting the client after running Cockroach with GODEBUG=schedtrace=1000 and seeing:

  • fewer threads with the higher concurrency
  • more idle threads with higher concurrency
  • a near-empty global run queue at both concurrency levels
  • empty local run queues at both concurrency levels

For instance:

concurrency=400

SCHED 58482ms: gomaxprocs=16 idleprocs=0 threads=47 spinningthreads=3 idlethreads=9 runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

concurrency=4000

SCHED 21150ms: gomaxprocs=16 idleprocs=13 threads=43 spinningthreads=0 idlethreads=17 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Combined with the execution traces above, it became apparent that goroutines weren't taking a particularly long time to be scheduled, there just weren't many goroutines to schedule.

@tbg
Copy link
Member

tbg commented Oct 1, 2018

Good job. Lots of prime lunch and learn material, too.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 1, 2018
In cockroachdb#26178, we saw that throughput hit a cliff while running `kv` at
high concurrency levels. We spent a while debugging the issue, but
nothing stood out in the `cockroach` process. Eventually I installed
pprof http handlers in `workload` (cockroachdb#30810). The CPU and heap profiles
looked fine but the mutex profile revealed that **99.94%** of mutex
contention was in `sql.(*Rows).Next`.

It turns out that this method manipulates a lock that's scoped to
the same degree as its prepared statement. Since `readStmt` was
prepared on the `sql.DB`, all kvOps were contending on the same lock
in `sql.(*Rows).Next`.

The fix is to give each `kvOp` its own `sql.Conn` and prepare the
statement with a connection-level scope. There are probably other areas
in `workload` that could use the same kind of change.

Before this change, `kv100 --concurrency=400` in the configuration
discussed in cockroachdb#26178 topped out at around 80,000 qps. After this change,
it tops out at around 250,000 qps.

Release note: None
craig bot pushed a commit that referenced this issue Oct 1, 2018
30811: workload: give each kvOp a separate sql.Conn r=nvanbenschoten a=nvanbenschoten

In #26178, we saw that throughput hit a cliff while running `kv` at
high concurrency levels. We spent a while debugging the issue, but
nothing stood out in the `cockroach` process. Eventually, I installed
pprof http handlers in `workload` (#30810). The CPU and heap profiles
looked fine but the mutex profile revealed that **99.94%** of mutex
contention was in `sql.(*Rows).Next`.

It turns out that this method manipulates a lock that's scoped to
the same degree as its prepared statement. Since `readStmt` was
prepared on the `sql.DB`, all kvOps were contending on the same lock
in `sql.(*Rows).Next`.

The fix is to give each `kvOp` its own `sql.Conn` and prepare the
statement with a connection-level scope. There are probably other places
in `workload` that could use the same kind of change.

Before this change, `kv100 --concurrency=400` in the configuration
discussed in #26178 topped out at around 80,000 qps. After this change,
it tops out at around 250,000 qps.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

6 participants