-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: investigate gRPC packeting optimizations #17370
Comments
could you add more explicit details here? Not as yet familiar with the code paths where this would have a measurable effect or even how I'd go about measuring this? Exec latencies, RTT latencies? I also recall some prior experiments you ran investigating this effectively reducing {99,95}-th percentile latencies (?), if there's anything relevant pertaining to this it'd be nice to add. |
See https://github.com/grpc/grpc-go/blob/master/transport/http2_server.go#L816 and https://github.com/grpc/grpc-go/blob/master/transport/http2_client.go#L681. The measurable effect of reducing the number of packets (or perhaps the benefit is from reducing the number of system calls) is measurably lower latencies which directly impact performance. Also see grpc/grpc-go#1373 which demonstrated some of the benefits, but had a deadlock scenario. What I'd like to explore here is even more aggressively reducing flushes when there are multiple goroutines sending concurrently. The tricky part here is how to do this without introducing deadlock scenarios caused by the application level flow control. But perhaps experimental code doesn't need to be perfect in order to show whether further investment is worthwhile. |
In addition to the Cockroach measured network latencies, I also look at the packet/sec numbers that the prometheus node exporter provides. For a given queries/sec, lower packet/sec is usually better. |
This matches what I saw previously. Will be interesting to see what happens if we perform less than 1 syscall per RPC. |
There's a bunch of other low hanging fruit I'm picking through. Here
|
It seems that this is already the case-ish. It's done so by the use of |
When I looked, the PS I see a lot of similarities between flushing the network connection and syncing the WAL. |
heh, grpc/grpc-go#1498 goes some way towards having a dedicated flushing goroutine. |
@irfansharif Yeah, I noticed that too. |
TL;DR? gRPC has very bad scalability with concurrent RPCs. We need to figure out what is going on and either fix it or replace it. Cc @spencerkimball, @bdarnell, @tschottdorf, @a-robinson, @nvanbenschoten I extended the
Here is the equivalent workload with the "x" (eXperimental) protocol:
I don't have an adequate explanation for the performance difference. "x" is much simpler than grpc and I took some pains to avoid unnecessary flushes of network data. But even accounting for that and trying to mimic the grpc behavior, "x" is still much faster. Note that 1000 concurrent client workers is extreme and the performance discrepancy doesn't scale down linearly. The following shows grpc scaling from 1 to 512 clients (in powers of 2):
And here is the same scaling with "x":
NB: I used 200 byte requests and responses because that is approximately the size of a |
Are you testing that against tip? I haven't checked our vendored grpc version, but I thought they had done some performance work (also doubt it'd catch up to X, though). Are you using single connections for both protocols? Is your pinger repo up to date? |
Yes, I'm testing against gRPC tip. The recent performance work provides a very small performance improvement.
Single connections for both protocols. There is a flag to allow using multiple connections, but gRPC doesn't show any benefit from doing that. Before doing this work on
It is now (just pushed). |
Hmm, a small tweak to how "x" performs synchronization to more closely mimic grpc causes its performance to drop to almost exactly the same level as grpc...but only on my laptop. Testing between 2 linux boxes over a real network still shows "x" to significantly outperform grpc. The synchronization difference is in how grpc and "x" notify the "write loop". In grpc (on tip), this looks like:
Essentially, grpc is using channels for synchronization.
Switching "x" to use |
Update, I've been able to slow down "x" between 2 linux machines through the combination of the switch to using a grpc-style
This is still faster than grpc, but I'm likely just missing some additional quota stuff (grpc has transport and stream level quota as well as quota on the number of streams). For reference, this is the performance I was seeing earlier with the cond-var based synchronization and no quota pool:
|
The quota pool corresponds to (stream and/or connection-level) window sizes, right? I saw that you have both at 65k initially. Does anything change if you up that to a lot more (especially the connection based one) or is the culprit just the internal overhead of having the quota pools in the first place? |
I'm not sure what the culprit is. We're never getting close to the stream quota as we're sending 200 byte requests. We do seem to be hitting the connection quota at high concurrency, but alleviating the limit doesn't provide a significant performance boost:
I also tried disabling all of the quota code in grpc by commenting it out:
So clearly something else is still going on inside grpc that is limiting its performance. |
CPU profile isn't useful? |
Heh, I'm looping through all my tools. CPU profile didn't show anything interesting earlier, I'll look again. |
If that doesn't help, you could consider (and you probably have) strategic commenting out of parts of the quota handling code 🥇 |
You might have missed above, but I already did exactly that. |
Oh, (completely) misread that as "I also set the window size to basically infinity". |
Yeah that'll be a good start. If we get a good enough picture of the workload/scenario we can update |
The cRPC protocol is built on top of http2 using only http2 data frames. The core read and write loops are similar to gRPC, but trimmed down to their essence. A ton of stuff is missing before this could actually be used, such as context cancellation propagation, trace propagation, chunking of large requests/responses into frames, and proper error handling. This PR was done as a proof of concept to see what performance impact replacing gRPC with something else could achieve. On a single-node cluster with the local server optimization disabled, gRPC on a read-mostly workload shows: _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 10.0s 0 111904 11189.4 1.4 1.2 3.0 5.8 18.9 cRPC: _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 10.0s 0 183923 18389.3 0.9 0.7 1.8 3.7 62.9 Unfortunately, the effects on a multi-node cluster are less dramatic. I think it would be possible to make cRPC wire compatible with gRPC. This would require sending a headers frame with every RPC, but the headers are small (7 bytes) and constant for this use case. We'd also have to send a settings frame as the first frame on the connection, but that is trivial. See cockroachdb#17370.
24883: dep: Bump grpc-go version r=a-robinson a=a-robinson And pull in new packages -- in particular, the encoding/proto package isn't needed for compilation but is needed at runtime. Release note: None --------------------- We should wait to merge this until they've cut the 1.12 release tag so that we aren't just at an arbitrary commit in their git history but I'm sending out the PR now so that I (or whoever would have done this) don't have to deal with debugging the missing encoding/proto package when it comes time to merge this. As tested in #17370 (comment), this gives a 5-10% boost in whole-cluster throughput and improved tail latencies when run with a highly concurrent workload. It appears to have little performance effect for lower-concurrency workloads. 25410: sql: run schema changes after CREATE TABLE in txn r=vivekmenezes a=vivekmenezes Includes a commit from #25362 and should be reviewed after that change. 25612: util: fix `retry.WithMaxAttempts` context cancelled before run. r=windchan7 a=windchan7 If context gets cancelled right after `retry.WithMaxAttempts` runs, the function passed to it will never gets run. Now `retry.WithMaxAttempts` will at least run the function once otherwise an error will be returned. Making this change because there are places such as `show_cluster_setting.go` require the passed in function to be run at least once. Otherwise there will be seg fault. Fixes: #25600. Fixes: #25603. Fixes: #25570. Fixes: #25567. Fixes: #25566. Fixes: #25511. Fixes: #25485. Release note: None 25625: storage: Adding testing knob to disable automatic lease renewals r=a-robinson a=a-robinson In order to fix the test flakes caused by automatic lease renewals Fixes #25537 Fixes #25540 Fixes #25568 Fixes #25573 Fixes #25576 Fixes #25589 Fixes #25594 Fixes #25599 Fixes #25605 Fixes #25620 Release note: None Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com> Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com> Co-authored-by: Victor Chen <victor@cockroachlabs.com>
gRPC creates unnecessary packets during unary response processing. An upstream PR (grpc/grpc-go#1373) to fix this was buggy and rejected and we're waiting on a real fix from the gRPC folks. But above and beyond reducing the number of packets per response to 1, we can look at reducing it below 1 by combining multiple responses into a single packet. Additionally, flushing the write buffer to the connection currently blocks other goroutines. We should investigate a pipelined approach where the write buffer is filled and when it needs to be flushed a new buffer is swapped into place while the buffer is written to the connection.
The text was updated successfully, but these errors were encountered: