-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: kv falls off a cliff at ~12 nodes #26178
Comments
@m-schneider and I have our first lead. All of the graphs on the adminUI and the CPU and heap profiles looked fine. Everything was either equal or actually larger on the cluster with concurrency before the cliff (scaled exactly to the delta in throughput). The blocking profiles also looked similar until we ignored blocking in Go select blocks. Then we saw this: 600 concurrency700 concurrencyOn the very left of the 700 concurrency flame graph we can see 4.22% of the blocking is in @m-schneider is also going to run the same experiment on AWS to see if she's able to reproduce the results. |
After taking another look this looks very reproducible on a 16 node n1-highcpu-16 cluster. After seeing the same behavior as @benesch I ran a version of the test that incremented concurrency a bit slower and found that the drop off is generally around a concurrency of 650 Then I tried the same test with a 24 node cluster and also saw a very similar drop off though marginally later |
@a-robinson you have a lot of knowledge about networking. I'm interested in how you'd recommend investigating something like this. |
Is the worker machine overloaded or at capacity? Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in I've used |
No, at the higher concurrency the cpu utilization actually drops because of the reduced throughput.
That's possible, although it would surprise me if such an issue reliably only showed up above a very specific concurrency threshold.
Thanks for the tip! |
I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in On a whim, I began tweaking I also tested with calling I'm interested to see the results of @m-schneider's investigation into whether this is new since |
One other interesting experiment would be short-circuiting everything. If
you hacked up your cockroach binary to return “no results” for every
external query instantly, you’d be able to zero in on whether this is a
network problem, load generator problem, or actual Cockroach problem.
…On Wed, Aug 29, 2018 at 12:50 AM Nathan VanBenschoten < ***@***.***> wrote:
I spent a bit more time looking at CPU profiles between concurrency levels
beneath this cliff and concurrency levels above the cliff. One thing that
jumped out was the increase in time spent in
runtime.schedule->runtime.findrunnable. Below this cliff this accounts
for 5-6% of the profile, above that number doubles to around 11%. This is
pretty significant, although I don't know exactly what to make of it right
now. We expect the number of goroutines to grow proportionally with the
number of SQL connections, but runtime.findrunnable doubling due to a
~20% increase in SQL connections seems awfully suspicious.
On a whim, I began tweaking StmtBuf to see if we were on to anything with
the blocking in StmtBuf.curCmd. StmtBuf currently uses a slice that's
shared between the producer (pgwire conn goroutine) and consumer (executor
goroutine). It uses a condition variable to signal updates to the buffer
and coordinate between the two goroutines. One theory I had was that this
cond var may be sub-optimal in terms of quickly preempting the producer and
scheduling the consumer whenever it is signalled. I ran an experiemnt where
I switched the StmtBuf to use a buffered channel instead of the condition
variable: ***@***.***
<nvanbenschoten@7408539>.
I like the change as it both simplifies the code and comes off as more
idiomatic, but unfortunately it didn't actually have any impact on
performance. The blocking contribution due to StmtBuf.curCmd's Cond.Wait
call was simply replaced by roughly the same contribution from
StmtBuf.curCmd's new runtime.selectgo call. It's possible that making
this an unbuffered channel would have an effect, but that wasn't as easy of
a change.
I also tested with calling runtime.Gosched immediately after signalling
the condition variable, but that again had no effect.
I'm interested to see the results of @m-schneider
<https://github.com/m-schneider>'s investigation into whether this is new
since connExecutor refactor. This will give us some indication of whether
this is a scheduling issue at the pgwire/connExecutor boundary or whether
it might be a scheduling issue at the network/pgwire boundary.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26178 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IKsPXDd0awSV4wGt_B8AR0X0yjR3ks5uVh2JgaJpZM4USHmA>
.
|
This seems somewhat related to golang/go#18237 and https://groups.google.com/forum/#!topic/golang-nuts/6zKXeCoT2LM. My main takeaway from this so far is that having goroutine coordination on the hot path of SQL statement execution is unfortunate. I wonder how hard it would be to rip out the second goroutine for the sake of experimentation. cc. @andreimatei. |
That's an interesting idea which is worth exploring. |
cc @andy-kimball as he was also looking at the impact of that goroutine bounce in the executor a while ago. |
Short-circuiting everything can prevent the load generator from running (i.e. you need |
After git bisecting the improvement in throughput can be attributed to bcbde02. We're currently adding metrics to see if the new data structure could also be causing the cliff. |
I spent some time looking into this after observing poor throughput scaling in sysbench as concurrency grew. I was able to reproduce the behavior observed here by spinning up a 16 node cluster with n1-highcpu-16 machines. Instead of using I started by looking at performance profiles. The CPU profile showed very little difference between the two runs. Neither did the heap profile. Blocking profiles showed extra blocking in I then turned to the Go execution tracer. This is where things got interesting. The first execution trace is with concurrency=400 concurrency=4000 There are a number of things that jump out from these traces:
Out of these two traces, it's pretty clear that the qualities of the first is more conducive of high throughput. So what gives? Why the degraded processor utilization in the second trace? I turned to the execution trace's built-in profiles to try to answer this. My first stop was the network blocking profile. This didn't provide too much insight. Both profiles showed 100% of network blocking in My next stop was the execution trace's scheduler latency profile. This was more interesting. The good profile attributed 66% of scheduler latency to Mutex unlocking. The two main callers of this were The differences here are stark and I think it's safe to conclude that this is contributing to the reduced throughput with higher client concurrency. The question now is what to do about it. I still need to look more at this to understand why we only see the issue with a higher concurrency, but the first thing that comes to mind is that we create a separate gRPC stream for each Another interesting thing to note is that we can see in I've included the two scheduler profiles here: scheduler_profiles.zip. |
My next step is to lean how all this |
I believe @a-robinson investigated having a pool of streams for The |
In cockroachdb#26178, we saw that throughput hit a cliff while running `kv` at high concurrency levels. We spent a while debugging the issue, but nothing stood out in the `cockroach` process. Eventually I installed pprof http handlers in `workload` (cockroachdb#30810). The CPU and heap profiles looked fine but the mutex profile revealed that **99.94%** of mutex contention was in `sql.(*Rows).Next`. It turns out that this method manipulates a lock that's scoped to the same degree as its prepared statement. Since `readStmt` was prepared on the `sql.DB`, all kvOps were contending on the same lock in `sql.(*Rows).Next`. The fix is to give each `kvOp` its own `sql.Conn` and prepare the statement with a connection-level scope. There are probably other areas in `workload` that could use the same kind of change. Before this change, `kv100 --concurrency=400` in the configuration discussed in cockroachdb#26178 topped out at around 80,000 qps. After this change, it tops out at around 250,000 qps. Release note: None
I looked into this and found that they can't easily use a cond variable there because they also want to wait on a context cancellation. Without adjusting other layers to catch the context cancellation and without golang/go#16620, there's not an easy path to making the replacement. But that doesn't matter now because... here's the fix: #30811. |
To wrap this all up, it turns out that CPU utilization was low because the client was bottlenecking itself. There wasn't anything going wrong in I began suspecting the client after running Cockroach with
For instance: concurrency=400
concurrency=4000
Combined with the execution traces above, it became apparent that goroutines weren't taking a particularly long time to be scheduled, there just weren't many goroutines to schedule. |
Good job. Lots of prime lunch and learn material, too. |
In cockroachdb#26178, we saw that throughput hit a cliff while running `kv` at high concurrency levels. We spent a while debugging the issue, but nothing stood out in the `cockroach` process. Eventually I installed pprof http handlers in `workload` (cockroachdb#30810). The CPU and heap profiles looked fine but the mutex profile revealed that **99.94%** of mutex contention was in `sql.(*Rows).Next`. It turns out that this method manipulates a lock that's scoped to the same degree as its prepared statement. Since `readStmt` was prepared on the `sql.DB`, all kvOps were contending on the same lock in `sql.(*Rows).Next`. The fix is to give each `kvOp` its own `sql.Conn` and prepare the statement with a connection-level scope. There are probably other areas in `workload` that could use the same kind of change. Before this change, `kv100 --concurrency=400` in the configuration discussed in cockroachdb#26178 topped out at around 80,000 qps. After this change, it tops out at around 250,000 qps. Release note: None
30811: workload: give each kvOp a separate sql.Conn r=nvanbenschoten a=nvanbenschoten In #26178, we saw that throughput hit a cliff while running `kv` at high concurrency levels. We spent a while debugging the issue, but nothing stood out in the `cockroach` process. Eventually, I installed pprof http handlers in `workload` (#30810). The CPU and heap profiles looked fine but the mutex profile revealed that **99.94%** of mutex contention was in `sql.(*Rows).Next`. It turns out that this method manipulates a lock that's scoped to the same degree as its prepared statement. Since `readStmt` was prepared on the `sql.DB`, all kvOps were contending on the same lock in `sql.(*Rows).Next`. The fix is to give each `kvOp` its own `sql.Conn` and prepare the statement with a connection-level scope. There are probably other places in `workload` that could use the same kind of change. Before this change, `kv100 --concurrency=400` in the configuration discussed in #26178 topped out at around 80,000 qps. After this change, it tops out at around 250,000 qps. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@petermattis, from Slack:
(We expect the throughput to smoothly level off while the latency increases.)
The text was updated successfully, but these errors were encountered: