-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/engine: batch concurrent commits in Go #14138
storage/engine: batch concurrent commits in Go #14138
Conversation
Running Before: After: Notice that the y-scales differ. The after is significantly faster at higher concurrency while having lower latencies. |
Wow, that's pretty substantial. Do you want an extra pair of eyes trying to figure out what's going on with the before case? Review status: 0 of 2 files reviewed at latest revision, all discussions resolved. Comments from Reviewable |
Yes. Or figuring out why the PR has such a dramatic effect. |
My current suspicion is that we're tickling badness in the Go scheduler. The Go runtime is composed of Gs, Ms and Ps. Gs are goroutines, Ps are "processors" which are the execution unit for goroutines (i.e. The Go runtime performs work stealing. If a P has no goroutines to run it "steals" half of the goroutines from a random other P. When a goroutine makes a cgo call, it appears that the P is not detached from the M. That means that any goroutines that are local to that P are stuck until another P steals them (or the cgo call returns). The code I'm looking at is in |
bf1b017
to
f76b50c
Compare
The Go scheduler has a background process named |
As of Feb 2015, cgo calls were not supposed to block their P for more than 20us, although it's possible that's changed: http://stackoverflow.com/questions/28354141/c-code-and-goroutine-scheduling Similarly: golang/go#8636 (comment) It might be worth running with schedtrace enabled |
I've tried that, though it doesn't reveal much. I might try turning on |
The creation of additional threads might itself be a cause of performance issues. I put together a small benchmark that attempts to mimic the blocking in our cgo code before and after this change, and the results are drastic enough that I kind of assume I messed something up. I'll try to check into what's going on with it, but if you're curious in the meantime - https://gist.github.com/a-robinson/27a75e4a2cc6f32e955dad5d7d513958 Perhaps I'm just using bad assumptions. I'm assuming a 100us gaps between incoming writes and 5ms of blocking in cgo per (batched) commit. |
Interesting. Let me take a look at that. 2000 is a lot higher concurrency than we see. |
Ok, well I found the first issue with the benchmark. It needed to be
|
And that makes the benchmark usable as an actual go benchmark, where they perform identically:
|
There could be a difference on linux. |
Indeed. On Linux, we get:
and
|
Any thoughts on what we want to do here? Seems clear there is some weird interaction with the Go scheduler/runtime, but I'm not terribly eager to debug it. Perhaps we should file an issue upstream and merge this PR which provides a decent workaround. |
👍 on the debugging here. Wow. |
f76b50c
to
329af54
Compare
I'm fine with opening an issue/question upstream and going forward with this PR for now. I assume there's someone out there to whom this isn't a very hard question. |
If we tweak the |
Interesting, it sounds like the mini-benchmark might not be an accurate enough simulation. What really gets me about your initial description is that "Instrumentation shows that internally RocksDB almost never batches commits together. While the batching below often can batch 20 or 30 concurrent commits." That doesn't make much sense to me if RocksDB can only write one batch to disk at a time. How could writes possibly not be batched up while waiting for the previous write(s) to finish? |
I don't know. I measured the number of writes batched by RocksDB a while ago. Perhaps my load is different now. Or something else changed. I should take another look at that. |
Final update before I take a break from this. Here's two longer profiles, the first from a 1M (12 minute) iteration run of cgo, the second from a 1M (2 minute) iteration of the batched logic, both on MacOS. The batched version had a higher CPU utilization while it was running by about a factor of 2 (18-20% vs 7-11%), but finished more than 5 times faster. |
Reviewed 2 of 2 files at r1. Comments from Reviewable |
Something is definitely funky with RocksDB batching. I ran Before this PR, with RocksDB performing the batching, I see the following breakdown of batch sizes:
With this PR (measuring the batching we perform):
Why is RocksDB batching performing so many more commits? In both runs I can see that The |
329af54
to
fe966b8
Compare
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):
I think you could get rid of the extra type by making this Comments from Reviewable |
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
True and that's actually what I initially did, but is that better? Comments from Reviewable |
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
I think it's a little nicer to not have the type, but it doesn't matter. Comments from Reviewable |
There are also a lot of conditions that disable batching (https://github.com/cockroachdb/c-rocksdb/blob/master/internal/db/write_thread.cc#L266-L299). Could we be hitting one of those? Maybe we're alternating between two types of batches that break up rocksdb's batching but work with ours. Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. Comments from Reviewable |
fe966b8
to
a7ea131
Compare
Review status: 1 of 2 files reviewed at latest revision, 1 unresolved discussion. pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Yeah, I noticed that and instrumented all of those code paths. None are being taken. I'll double-check this later. Review status: 1 of 2 files reviewed at latest revision, 1 unresolved discussion. Comments from Reviewable |
I double-checked and none of the conditions that can prohibit batching ever fire. I added a bunch more instrumentation to the RocksDB code and thoroughly spelunked it and can't see anything wrong with it. So I reimplemented the batching this PR implements in Go in our C++ code. And I see the exact same sort of batch sizes that the RocksDB code is producing. Out of ~420k calls to commit, the C++ batching produced 220k batches. The implication is that the difference in batching is due to it being performed in C++ vs Go which points the finger at some bad interaction with the Go runtime/scheduler. |
Batch concurrent commits of write-only batches (i.e. most batches) in Go. This gives a 10% performance boost on a write-only workload on my laptop and a 50% performance boost on a write-only workload on a single-node cluster running on Azure. See cockroachdb#13974
a7ea131
to
f99285a
Compare
Added support for DeleteRange to DBBatchInserter. This missing support was causing test failures and various data corruption when enabling COCKROACH_ENABLE_FAST_CLEAR_RANGE after cockroachdb#14138 started using ApplyBatchRepr on batches containing DeleteRange operations. See cockroachdb#14391
Added support for DeleteRange to DBBatchInserter. This missing support was causing test failures and various data corruption when enabling COCKROACH_ENABLE_FAST_CLEAR_RANGE after cockroachdb#14138 started using ApplyBatchRepr on batches containing DeleteRange operations. See cockroachdb#14391
Batch concurrent commits of write-only batches (i.e. most batches) in
Go. This gives a 10% performance boost on a write-only workload on my
laptop and a 50% performance boost on a write-only workload on a
single-node cluster running on Azure.
See #13974
This change is