storage/engine: batch concurrent commits in Go #14138

petermattis · 2017-03-14T15:46:02Z

Batch concurrent commits of write-only batches (i.e. most batches) in
Go. This gives a 10% performance boost on a write-only workload on my
laptop and a 50% performance boost on a write-only workload on a
single-node cluster running on Azure.

See #13974

This change is

petermattis · 2017-03-14T15:48:21Z

Running for i in 6 12 24 48 96 192 384; do ./kv --duration 10m --read-percent 0 --tolerate-errors --concurrency $i --splits 0 'postgresql://root@localhost:27183?sslrootcert=certs/ca.crt&sslcert=certs/root.client.crt&sslkey=certs/root.client.key'; done against a 6-node cluster where the test.kv table was pre-split into 2000 ranges.

Before:

After:

Notice that the y-scales differ. The after is significantly faster at higher concurrency while having lower latencies.

a-robinson · 2017-03-14T17:05:36Z

Wow, that's pretty substantial. Do you want an extra pair of eyes trying to figure out what's going on with the before case?

Review status: 0 of 2 files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

petermattis · 2017-03-14T17:09:06Z

Do you want an extra pair of eyes trying to figure out what's going on with the before case?

Yes. Or figuring out why the PR has such a dramatic effect.

petermattis · 2017-03-14T18:28:00Z

My current suspicion is that we're tickling badness in the Go scheduler. The Go runtime is composed of Gs, Ms and Ps. Gs are goroutines, Ps are "processors" which are the execution unit for goroutines (i.e. GOMAXPROCS) and Ms are OS threads. Note that there are usually many more Gs than Ms and more Ms than Ps. A running G has both an associated P and an M. Internally, a P has a set of local goroutines to run and there is also a global set if the local set grows too large.

The Go runtime performs work stealing. If a P has no goroutines to run it "steals" half of the goroutines from a random other P.

When a goroutine makes a cgo call, it appears that the P is not detached from the M. That means that any goroutines that are local to that P are stuck until another P steals them (or the cgo call returns). The code I'm looking at is in runtime/proc.go. See cgocall and entersyscall. Interestingly, there is an entersyscallblock which is used for known blocking system call. That routine in turn calls entersyscallblock_handoff which hands of the current P to another M. I wonder if the time consuming cgo calls (such as batch commits) are causing goroutine starvation.

petermattis · 2017-03-14T19:02:45Z

The Go scheduler has a background process named sysmon that periodically loops over the Ps and reschedules them on an idle or new M if they are blocked in a system call or cgo call. The minimal delay for this check is 20us and the maximum delay is 10ms, though the actual delay could be longer.

a-robinson · 2017-03-14T19:03:54Z

As of Feb 2015, cgo calls were not supposed to block their P for more than 20us, although it's possible that's changed: http://stackoverflow.com/questions/28354141/c-code-and-goroutine-scheduling

Similarly: golang/go#8636 (comment)

It might be worth running with schedtrace enabled

petermattis · 2017-03-14T19:09:21Z

It might be worth running with schedtrace enabled

I've tried that, though it doesn't reveal much. I might try turning on scheddetail too, but that can be overwhelming (one line per goroutine).

a-robinson · 2017-03-14T20:33:31Z

The creation of additional threads might itself be a cause of performance issues. I put together a small benchmark that attempts to mimic the blocking in our cgo code before and after this change, and the results are drastic enough that I kind of assume I messed something up. I'll try to check into what's going on with it, but if you're curious in the meantime - https://gist.github.com/a-robinson/27a75e4a2cc6f32e955dad5d7d513958

Perhaps I'm just using bad assumptions. I'm assuming a 100us gaps between incoming writes and 5ms of blocking in cgo per (batched) commit.

petermattis · 2017-03-14T20:39:14Z

Interesting. Let me take a look at that. 2000 is a lot higher concurrency than we see.

a-robinson · 2017-03-14T20:41:33Z

Ok, well I found the first issue with the benchmark. It needed to be C.sleepMicros(C.uint(5 * time.Millisecond / time.Microsecond)), not C.sleepMicros(C.uint(5 * time.Millisecond)). That gets us to

$ go test .
--- FAIL: TestCgoSleep (1.00s)
	cgo_sleep_test.go:23: 1.001945863s
--- FAIL: TestBatchedSleep (0.28s)
	cgo_sleep_test.go:29: 284.268565ms

a-robinson · 2017-03-14T20:43:24Z

And that makes the benchmark usable as an actual go benchmark, where they perform identically:

$ go test -bench=.
BenchmarkCgoSleep-8       	   10000	    136946 ns/op
BenchmarkBatchedSleep-8   	   10000	    136282 ns/op
PASS
ok  	_/Users/alex/play/run	2.798s

petermattis · 2017-03-14T20:45:25Z

There could be a difference on linux.

a-robinson · 2017-03-14T20:57:22Z

Indeed. On Linux, we get:

--- FAIL: TestCgoSleep (12.83s)
	cgo_sleep_test.go:19: 12.831447677s
--- FAIL: TestBatchedSleep (6.91s)
	cgo_sleep_test.go:25: 6.909217603s

and

BenchmarkCgoSleep-16        	   10000	    306480 ns/op
BenchmarkBatchedSleep-16    	   10000	    186300 ns/op

petermattis · 2017-03-15T12:24:30Z

Any thoughts on what we want to do here? Seems clear there is some weird interaction with the Go scheduler/runtime, but I'm not terribly eager to debug it. Perhaps we should file an issue upstream and merge this PR which provides a decent workaround.

spencerkimball · 2017-03-15T13:21:47Z

👍 on the debugging here. Wow.

a-robinson · 2017-03-15T14:02:13Z

I'm fine with opening an issue/question upstream and going forward with this PR for now. I assume there's someone out there to whom this isn't a very hard question.

a-robinson · 2017-03-15T14:38:36Z

Here's a CPU profile from a slow run of TestCgoSleep with 10k iterations. It shows a lot of time spent on scheduler operations:

petermattis · 2017-03-15T14:41:37Z

If we tweak the sysmon delay to be a maximum of 1ms (right now it is 10ms), then the discrepancy disappears. Unfortunately, that doesn't seem to help cockroach performance, though I only tested this on Mac OS X, not Linux.

a-robinson · 2017-03-15T14:57:13Z

Interesting, it sounds like the mini-benchmark might not be an accurate enough simulation. What really gets me about your initial description is that "Instrumentation shows that internally RocksDB almost never batches commits together. While the batching below often can batch 20 or 30 concurrent commits." That doesn't make much sense to me if RocksDB can only write one batch to disk at a time. How could writes possibly not be batched up while waiting for the previous write(s) to finish?

petermattis · 2017-03-15T14:58:36Z

How could writes possibly not be batched up while waiting for the previous write(s) to finish?

I don't know. I measured the number of writes batched by RocksDB a while ago. Perhaps my load is different now. Or something else changed. I should take another look at that.

a-robinson · 2017-03-15T15:03:06Z

Final update before I take a break from this. Here's two longer profiles, the first from a 1M (12 minute) iteration run of cgo, the second from a 1M (2 minute) iteration of the batched logic, both on MacOS. The batched version had a higher CPU utilization while it was running by about a factor of 2 (18-20% vs 7-11%), but finished more than 5 times faster.

a-robinson · 2017-03-15T15:07:35Z

The code by the way

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

petermattis · 2017-03-15T18:59:48Z

Something is definitely funky with RocksDB batching. I ran kv --read-percent 0 --max-ops 200000 --splits 1000 --concurrency 64 against a single node cluster.

Before this PR, with RocksDB performing the batching, I see the following breakdown of batch sizes:

 141016   1  63.9%
  40054   2  18.1%
  21780   3   9.9%
   9428   4   4.3%
   3501   5   1.6%
...

With this PR (measuring the batching we perform):

  21354   1  58.3%
    338   2   0.9%
    312   3   0.9%
    279   4   0.8%
    257   5   0.7%
...

Why is RocksDB batching performing so many more commits? In both runs I can see that rocksDBBatch.Commit was called ~420k times. With RocksDB batching this turned in to ~220k log writes. With our own batching: ~37k log writes.

The WriterThread code has all sorts of fanciness with lock-free lists. I wonder if it has a bug preventing batching when possible.

bdarnell · 2017-03-15T19:37:44Z

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

		commitSeq  uint64
		pendingSeq uint64
		pending    []pendingBatch

I think you could get rid of the extra type by making this pending []*rocksDBBatch and adding pendingSync bool (so each batch would do c.pendingSync = c.pendingSync || syncCommit when it appends itself to pending).

Comments from Reviewable

petermattis · 2017-03-15T19:41:10Z

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think you could get rid of the extra type by making this pending []*rocksDBBatch and adding pendingSync bool (so each batch would do c.pendingSync = c.pendingSync || syncCommit when it appends itself to pending).

True and that's actually what I initially did, but is that better?

Comments from Reviewable

bdarnell · 2017-03-15T19:42:35Z

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

True and that's actually what I initially did, but is that better?

I think it's a little nicer to not have the type, but it doesn't matter.

Comments from Reviewable

bdarnell · 2017-03-15T19:48:15Z

The WriterThread code has all sorts of fanciness with lock-free lists. I wonder if it has a bug preventing batching when possible.

There are also a lot of conditions that disable batching (https://github.com/cockroachdb/c-rocksdb/blob/master/internal/db/write_thread.cc#L266-L299). Could we be hitting one of those? Maybe we're alternating between two types of batches that break up rocksdb's batching but work with ours.

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

Comments from Reviewable

petermattis · 2017-03-15T19:50:37Z

Review status: 1 of 2 files reviewed at latest revision, 1 unresolved discussion.

pkg/storage/engine/rocksdb.go, line 306 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think it's a little nicer to not have the type, but it doesn't matter.

Done.

Comments from Reviewable

petermattis · 2017-03-15T19:52:25Z

There are also a lot of conditions that disable batching

Yeah, I noticed that and instrumented all of those code paths. None are being taken. I'll double-check this later.

Review status: 1 of 2 files reviewed at latest revision, 1 unresolved discussion.

Comments from Reviewable

petermattis · 2017-03-16T01:01:49Z

Yeah, I noticed that and instrumented all of those code paths. None are being taken. I'll double-check this later.

I double-checked and none of the conditions that can prohibit batching ever fire. I added a bunch more instrumentation to the RocksDB code and thoroughly spelunked it and can't see anything wrong with it. So I reimplemented the batching this PR implements in Go in our C++ code. And I see the exact same sort of batch sizes that the RocksDB code is producing. Out of ~420k calls to commit, the C++ batching produced 220k batches. The implication is that the difference in batching is due to it being performed in C++ vs Go which points the finger at some bad interaction with the Go runtime/scheduler.

Batch concurrent commits of write-only batches (i.e. most batches) in Go. This gives a 10% performance boost on a write-only workload on my laptop and a 50% performance boost on a write-only workload on a single-node cluster running on Azure. See cockroachdb#13974

Added support for DeleteRange to DBBatchInserter. This missing support was causing test failures and various data corruption when enabling COCKROACH_ENABLE_FAST_CLEAR_RANGE after cockroachdb#14138 started using ApplyBatchRepr on batches containing DeleteRange operations. See cockroachdb#14391

petermattis force-pushed the pmattis/rocksdb-batch branch from bf1b017 to f76b50c Compare March 14, 2017 18:28

petermattis force-pushed the pmattis/rocksdb-batch branch from f76b50c to 329af54 Compare March 15, 2017 13:41

petermattis force-pushed the pmattis/rocksdb-batch branch from 329af54 to fe966b8 Compare March 15, 2017 19:06

petermattis force-pushed the pmattis/rocksdb-batch branch from fe966b8 to a7ea131 Compare March 15, 2017 19:50

petermattis force-pushed the pmattis/rocksdb-batch branch from a7ea131 to f99285a Compare March 16, 2017 01:05

petermattis mentioned this pull request Mar 16, 2017

runtime: performance problem with many Cgo calls golang/go#19574

Open

petermattis merged commit 8fa4363 into cockroachdb:master Mar 16, 2017

petermattis deleted the pmattis/rocksdb-batch branch March 16, 2017 17:12

petermattis mentioned this pull request Mar 28, 2017

storage/engine: fix DeleteRange+ApplyBatchRepr #14416

Merged

dianasaur323 mentioned this pull request Apr 12, 2017

perf: optimize the insert performance #13974

Closed

petermattis mentioned this pull request Sep 14, 2018

perf: command commit latency is highly correlated with range count #30213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/engine: batch concurrent commits in Go #14138

storage/engine: batch concurrent commits in Go #14138

petermattis commented Mar 14, 2017 •

edited by rjnn

Loading

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

petermattis commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 15, 2017

spencerkimball commented Mar 15, 2017

a-robinson commented Mar 15, 2017

a-robinson commented Mar 15, 2017

petermattis commented Mar 15, 2017

a-robinson commented Mar 15, 2017

petermattis commented Mar 15, 2017

a-robinson commented Mar 15, 2017

a-robinson commented Mar 15, 2017

petermattis commented Mar 15, 2017

bdarnell commented Mar 15, 2017

petermattis commented Mar 15, 2017

bdarnell commented Mar 15, 2017

bdarnell commented Mar 15, 2017

petermattis commented Mar 15, 2017

petermattis commented Mar 15, 2017

petermattis commented Mar 16, 2017

storage/engine: batch concurrent commits in Go #14138

storage/engine: batch concurrent commits in Go #14138

Conversation

petermattis commented Mar 14, 2017 • edited by rjnn Loading

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

petermattis commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 14, 2017

a-robinson commented Mar 14, 2017

petermattis commented Mar 15, 2017

spencerkimball commented Mar 15, 2017

a-robinson commented Mar 15, 2017

a-robinson commented Mar 15, 2017

petermattis commented Mar 15, 2017

a-robinson commented Mar 15, 2017

petermattis commented Mar 15, 2017

a-robinson commented Mar 15, 2017

a-robinson commented Mar 15, 2017

petermattis commented Mar 15, 2017

bdarnell commented Mar 15, 2017

petermattis commented Mar 15, 2017

bdarnell commented Mar 15, 2017

bdarnell commented Mar 15, 2017

petermattis commented Mar 15, 2017

petermattis commented Mar 15, 2017

petermattis commented Mar 16, 2017

petermattis commented Mar 14, 2017 •

edited by rjnn

Loading