Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/engine: batch concurrent commits in Go #14138

Merged
merged 1 commit into from
Mar 16, 2017

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Mar 14, 2017

Batch concurrent commits of write-only batches (i.e. most batches) in
Go. This gives a 10% performance boost on a write-only workload on my
laptop and a 50% performance boost on a write-only workload on a
single-node cluster running on Azure.

See #13974


This change is Reviewable

@petermattis
Copy link
Collaborator Author

Running for i in 6 12 24 48 96 192 384; do ./kv --duration 10m --read-percent 0 --tolerate-errors --concurrency $i --splits 0 'postgresql://root@localhost:27183?sslrootcert=certs/ca.crt&sslcert=certs/root.client.crt&sslkey=certs/root.client.key'; done against a 6-node cluster where the test.kv table was pre-split into 2000 ranges.

Before:

screen shot 2017-03-14 at 11 39 31 am

screen shot 2017-03-14 at 11 39 39 am

screen shot 2017-03-14 at 11 40 05 am

After:

screen shot 2017-03-14 at 10 24 36 am

screen shot 2017-03-14 at 10 24 46 am

screen shot 2017-03-14 at 10 24 56 am

Notice that the y-scales differ. The after is significantly faster at higher concurrency while having lower latencies.

@a-robinson
Copy link
Contributor

Wow, that's pretty substantial. Do you want an extra pair of eyes trying to figure out what's going on with the before case?


Review status: 0 of 2 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Do you want an extra pair of eyes trying to figure out what's going on with the before case?

Yes. Or figuring out why the PR has such a dramatic effect.

@petermattis
Copy link
Collaborator Author

My current suspicion is that we're tickling badness in the Go scheduler. The Go runtime is composed of Gs, Ms and Ps. Gs are goroutines, Ps are "processors" which are the execution unit for goroutines (i.e. GOMAXPROCS) and Ms are OS threads. Note that there are usually many more Gs than Ms and more Ms than Ps. A running G has both an associated P and an M. Internally, a P has a set of local goroutines to run and there is also a global set if the local set grows too large.

The Go runtime performs work stealing. If a P has no goroutines to run it "steals" half of the goroutines from a random other P.

When a goroutine makes a cgo call, it appears that the P is not detached from the M. That means that any goroutines that are local to that P are stuck until another P steals them (or the cgo call returns). The code I'm looking at is in runtime/proc.go. See cgocall and entersyscall. Interestingly, there is an entersyscallblock which is used for known blocking system call. That routine in turn calls entersyscallblock_handoff which hands of the current P to another M. I wonder if the time consuming cgo calls (such as batch commits) are causing goroutine starvation.

@petermattis petermattis force-pushed the pmattis/rocksdb-batch branch from bf1b017 to f76b50c Compare March 14, 2017 18:28
@petermattis
Copy link
Collaborator Author

The Go scheduler has a background process named sysmon that periodically loops over the Ps and reschedules them on an idle or new M if they are blocked in a system call or cgo call. The minimal delay for this check is 20us and the maximum delay is 10ms, though the actual delay could be longer.

@a-robinson
Copy link
Contributor

As of Feb 2015, cgo calls were not supposed to block their P for more than 20us, although it's possible that's changed: http://stackoverflow.com/questions/28354141/c-code-and-goroutine-scheduling

Similarly: golang/go#8636 (comment)

It might be worth running with schedtrace enabled

@petermattis
Copy link
Collaborator Author

It might be worth running with schedtrace enabled

I've tried that, though it doesn't reveal much. I might try turning on scheddetail too, but that can be overwhelming (one line per goroutine).

@a-robinson
Copy link
Contributor

The creation of additional threads might itself be a cause of performance issues. I put together a small benchmark that attempts to mimic the blocking in our cgo code before and after this change, and the results are drastic enough that I kind of assume I messed something up. I'll try to check into what's going on with it, but if you're curious in the meantime - https://gist.github.com/a-robinson/27a75e4a2cc6f32e955dad5d7d513958

Perhaps I'm just using bad assumptions. I'm assuming a 100us gaps between incoming writes and 5ms of blocking in cgo per (batched) commit.

@petermattis
Copy link
Collaborator Author

Interesting. Let me take a look at that. 2000 is a lot higher concurrency than we see.

@a-robinson
Copy link
Contributor

Ok, well I found the first issue with the benchmark. It needed to be C.sleepMicros(C.uint(5 * time.Millisecond / time.Microsecond)), not C.sleepMicros(C.uint(5 * time.Millisecond)). That gets us to

$ go test .
--- FAIL: TestCgoSleep (1.00s)
	cgo_sleep_test.go:23: 1.001945863s
--- FAIL: TestBatchedSleep (0.28s)
	cgo_sleep_test.go:29: 284.268565ms

@a-robinson
Copy link
Contributor

And that makes the benchmark usable as an actual go benchmark, where they perform identically:

$ go test -bench=.
BenchmarkCgoSleep-8       	   10000	    136946 ns/op
BenchmarkBatchedSleep-8   	   10000	    136282 ns/op
PASS
ok  	_/Users/alex/play/run	2.798s

@petermattis
Copy link
Collaborator Author

There could be a difference on linux.

@a-robinson
Copy link
Contributor

Indeed. On Linux, we get:

--- FAIL: TestCgoSleep (12.83s)
	cgo_sleep_test.go:19: 12.831447677s
--- FAIL: TestBatchedSleep (6.91s)
	cgo_sleep_test.go:25: 6.909217603s

and

BenchmarkCgoSleep-16        	   10000	    306480 ns/op
BenchmarkBatchedSleep-16    	   10000	    186300 ns/op

@petermattis
Copy link
Collaborator Author

Any thoughts on what we want to do here? Seems clear there is some weird interaction with the Go scheduler/runtime, but I'm not terribly eager to debug it. Perhaps we should file an issue upstream and merge this PR which provides a decent workaround.

@spencerkimball
Copy link
Member

👍 on the debugging here. Wow.

@petermattis petermattis force-pushed the pmattis/rocksdb-batch branch from f76b50c to 329af54 Compare March 15, 2017 13:41
@a-robinson
Copy link
Contributor

I'm fine with opening an issue/question upstream and going forward with this PR for now. I assume there's someone out there to whom this isn't a very hard question.

@a-robinson
Copy link
Contributor

Here's a CPU profile from a slow run of TestCgoSleep with 10k iterations. It shows a lot of time spent on scheduler operations:

cgo-calls-cpu

@petermattis
Copy link
Collaborator Author

If we tweak the sysmon delay to be a maximum of 1ms (right now it is 10ms), then the discrepancy disappears. Unfortunately, that doesn't seem to help cockroach performance, though I only tested this on Mac OS X, not Linux.

@a-robinson
Copy link
Contributor

Interesting, it sounds like the mini-benchmark might not be an accurate enough simulation. What really gets me about your initial description is that "Instrumentation shows that internally RocksDB almost never batches commits together. While the batching below often can batch 20 or 30 concurrent commits." That doesn't make much sense to me if RocksDB can only write one batch to disk at a time. How could writes possibly not be batched up while waiting for the previous write(s) to finish?

@petermattis
Copy link
Collaborator Author

How could writes possibly not be batched up while waiting for the previous write(s) to finish?

I don't know. I measured the number of writes batched by RocksDB a while ago. Perhaps my load is different now. Or something else changed. I should take another look at that.

@a-robinson
Copy link
Contributor

Final update before I take a break from this. Here's two longer profiles, the first from a 1M (12 minute) iteration run of cgo, the second from a 1M (2 minute) iteration of the batched logic, both on MacOS. The batched version had a higher CPU utilization while it was running by about a factor of 2 (18-20% vs 7-11%), but finished more than 5 times faster.

cgo-calls-cpu-1m

batched-calls-cpu-1m

@a-robinson
Copy link
Contributor

The code :lgtm: by the way


Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Something is definitely funky with RocksDB batching. I ran kv --read-percent 0 --max-ops 200000 --splits 1000 --concurrency 64 against a single node cluster.

Before this PR, with RocksDB performing the batching, I see the following breakdown of batch sizes:

 141016   1  63.9%
  40054   2  18.1%
  21780   3   9.9%
   9428   4   4.3%
   3501   5   1.6%
...

With this PR (measuring the batching we perform):

  21354   1  58.3%
    338   2   0.9%
    312   3   0.9%
    279   4   0.8%
    257   5   0.7%
...

Why is RocksDB batching performing so many more commits? In both runs I can see that rocksDBBatch.Commit was called ~420k times. With RocksDB batching this turned in to ~220k log writes. With our own batching: ~37k log writes.

The WriterThread code has all sorts of fanciness with lock-free lists. I wonder if it has a bug preventing batching when possible.

@petermattis petermattis force-pushed the pmattis/rocksdb-batch branch from 329af54 to fe966b8 Compare March 15, 2017 19:06
@bdarnell
Copy link
Contributor

:lgtm:


Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

		commitSeq  uint64
		pendingSeq uint64
		pending    []pendingBatch

I think you could get rid of the extra type by making this pending []*rocksDBBatch and adding pendingSync bool (so each batch would do c.pendingSync = c.pendingSync || syncCommit when it appends itself to pending).


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think you could get rid of the extra type by making this pending []*rocksDBBatch and adding pendingSync bool (so each batch would do c.pendingSync = c.pendingSync || syncCommit when it appends itself to pending).

True and that's actually what I initially did, but is that better?


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

True and that's actually what I initially did, but is that better?

I think it's a little nicer to not have the type, but it doesn't matter.


Comments from Reviewable

@bdarnell
Copy link
Contributor

The WriterThread code has all sorts of fanciness with lock-free lists. I wonder if it has a bug preventing batching when possible.

There are also a lot of conditions that disable batching (https://github.com/cockroachdb/c-rocksdb/blob/master/internal/db/write_thread.cc#L266-L299). Could we be hitting one of those? Maybe we're alternating between two types of batches that break up rocksdb's batching but work with ours.


Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rocksdb-batch branch from fe966b8 to a7ea131 Compare March 15, 2017 19:50
@petermattis
Copy link
Collaborator Author

Review status: 1 of 2 files reviewed at latest revision, 1 unresolved discussion.


pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think it's a little nicer to not have the type, but it doesn't matter.

Done.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

There are also a lot of conditions that disable batching

Yeah, I noticed that and instrumented all of those code paths. None are being taken. I'll double-check this later.


Review status: 1 of 2 files reviewed at latest revision, 1 unresolved discussion.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Yeah, I noticed that and instrumented all of those code paths. None are being taken. I'll double-check this later.

I double-checked and none of the conditions that can prohibit batching ever fire. I added a bunch more instrumentation to the RocksDB code and thoroughly spelunked it and can't see anything wrong with it. So I reimplemented the batching this PR implements in Go in our C++ code. And I see the exact same sort of batch sizes that the RocksDB code is producing. Out of ~420k calls to commit, the C++ batching produced 220k batches. The implication is that the difference in batching is due to it being performed in C++ vs Go which points the finger at some bad interaction with the Go runtime/scheduler.

Batch concurrent commits of write-only batches (i.e. most batches) in
Go. This gives a 10% performance boost on a write-only workload on my
laptop and a 50% performance boost on a write-only workload on a
single-node cluster running on Azure.

See cockroachdb#13974
@petermattis petermattis force-pushed the pmattis/rocksdb-batch branch from a7ea131 to f99285a Compare March 16, 2017 01:05
@petermattis petermattis merged commit 8fa4363 into cockroachdb:master Mar 16, 2017
@petermattis petermattis deleted the pmattis/rocksdb-batch branch March 16, 2017 17:12
petermattis added a commit to petermattis/cockroach that referenced this pull request Mar 28, 2017
Added support for DeleteRange to DBBatchInserter. This missing support
was causing test failures and various data corruption when enabling
COCKROACH_ENABLE_FAST_CLEAR_RANGE after cockroachdb#14138 started using
ApplyBatchRepr on batches containing DeleteRange operations.

See cockroachdb#14391
petermattis added a commit to petermattis/cockroach that referenced this pull request Mar 28, 2017
Added support for DeleteRange to DBBatchInserter. This missing support
was causing test failures and various data corruption when enabling
COCKROACH_ENABLE_FAST_CLEAR_RANGE after cockroachdb#14138 started using
ApplyBatchRepr on batches containing DeleteRange operations.

See cockroachdb#14391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants