storage: batch Raft entry application across ready struct #37426

nvanbenschoten · 2019-05-09T16:56:46Z

Old prototype: #15648.

The Raft proposal pipeline currently applies Raft entries one-by-one to our storage engine in handleRaftReadyRaftMuLocked. We've seen time and time again that batching writes to RocksDB provides a sizable speedup, so this is a perfect candidate for batching. This will come with three concrete wins:

batching the writes reduces the total amount of work incident on the system
batching the writes reduces the number of cgo calls performed by Raft processing
batching the writes should speed up a single iteration of handleRaftReadyRaftMuLocked, allowing the Range to schedule and run the next iteration sooner

Put together, this should be a nice win to the amount of concurrency supported by the Raft proposal pipeline.

Implementation notes

To make this change we will need to transpose the order of operations in the committed entries loop and in processRaftCommand. Instead of the control flow looking like:

for each entry:
    check if failed
    if not:
        apply to engine
    ack client, release latches

it will look something like

for each entry:
    check if failed
    if not:
        apply to batch
commit batch
for each entry:
    ack client, release latches

Care will be needed around handling in-memory side effects of Raft entries correctly.

Predicted win

The improvement on single-range throughput as predicted by rafttoy is:

name                        old time/op    new time/op    delta
Raft/conc=256/bytes=256-16    15.6µs ± 5%    14.6µs ±15%  -6.87%  (p=0.046 n=16+20)

name                        old speed      new speed      delta
Raft/conc=256/bytes=256-16  16.4MB/s ± 5%  17.7MB/s ±17%  +8.30%  (p=0.043 n=16+20)

It's worth noting that rafttoy is using pebble instead of RocksDB. Writing to pebble doesn't incur a cgo call, which means that batching writes isn't quite as critical. This means that it's reasonable to expect the improvement to throughput in Cockroach will be even greater than this prediction.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2019-06-14T14:07:53Z

An optimization that falls out cleanly from this is that we can write the RangeAppliedStateKey (the combination of the RaftAppliedIndex, LeaseAppliedIndex, and RangeStats) only once per batch of committed entries instead of once per committed entry.

ajwerner · 2019-06-14T15:14:05Z

If a key is re-written many times in the same batch, do all of those writes end up in the memtable when the batch gets committed?

nvanbenschoten · 2019-06-14T15:18:08Z

do all of those writes end up in the memtable when the batch gets committed?

Yes. According to @ajkr they don't get collapsed until the memtable is compacted into L0.

petermattis · 2019-06-14T15:19:33Z

Yes. According to @ajkr they don't get collapsed until the memtable is compacted into L0.

Yep. I was just about to say the same thing.

nvanbenschoten · 2019-06-15T16:55:52Z

I've also seen lock contention here in Replica.handleReplicatedEvalResult interact poorly with above-Raft processing. If we were to move this work to once-per-Raft ready iteration instead of once-per-committed entry, I'm guessing we'd see some movement on workloads with single-Range hotspots.

ajwerner · 2019-06-25T14:20:08Z

I started typing this yesterday and there are some complications for which I want to propose a minor alteration.

The logic inside of Replica.handleReplicatedEvalResult() expects on-disk state to mirror in-memory state and also expects that the in-memory state reflects the on-disk state exactly upon application of the current command. In order to avoid majorly changing the existing logic I propose that we only batch commands which have a "trivial" ReplicatedEvalResult This is roughly the set of commands for which shouldAssert in handleReplicatedEvalResult false. This would excludes splits, merges, lease transfers, replica changes, any changes to the descriptor, etc.

This way all that needs to be accumulated as far as in-memory state goes should be the stats and and the index values which can be accumulated and then set atomically in the replica state when the batch commits and provides a clear location to write the RangeAppliedState. Commands which have a non-trivial ReplicatedEvalResult can be applied in their own "batch". These commands ought to be rare so this should provide roughly all of the win with less complexity. Sound reasonable?

nvanbenschoten · 2019-06-25T14:41:06Z

Discussed in person. This SGTM.

This commit batches raft command application where possible. The basic approach is to notice that many commands only "trivially" update the replica state machine. Trivial commands can be processed in a single batch by acting on a copy of the replica state. Non-trivial commands share the same logic but always commit alone as they for one reason or another rely on having a view of the replica or storage engine as of a specific log index. This commit also sneaks in another optimization which batching enables. Each command mutates a portion of replica state called the applied state which tracks a combination of the log index which has been applied and the MVCC stats of the range as of that application. Before this change each entry would update this applied state and each of those writes will end up in the WAL, mem-table, and L0 just the be compacted away in L1. Now that commands are being applied to the storage engine in a single batch it is easy to only update the applied state for the last entry in the batch. For sequential writes this patch shows a considerable performance win. The below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 64. ``` name old ops/s new ops/s delta KV0-throughput 20.2k ± 1% 25.2k ± 1% +24.94% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 4.45 ± 8% 3.70 ± 5% -16.85% (p=0.029 n=4+4) KV0-Avg 3.10 ± 0% 2.50 ± 0% -19.35% (p=0.029 n=4+4) ``` Due to the re-organization of logic in the change, the Replica.mu does not need to be acquired as many times during the application of a batch. In the common case it is now acquired exactly twice in the process of applying a batch whereas before it was acquired more than twice per entry. This should hopefully improve performance on large machines which experience mutex contention for a single range. This effect is visible on large machines. Below are results from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with concurrency 1024 and 16 initial splits. ``` name old ops/s new ops/s delta KV0-throughput 78.1k ± 1% 116.8k ± 5% +49.42% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 24.4 ± 3% 19.7 ± 7% -19.28% (p=0.029 n=4+4) KV0-Avg 12.6 ± 0% 7.5 ± 9% -40.87% (p=0.029 n=4+4) ``` Fixes cockroachdb#37426. Release note (performance improvement): Batch raft entry application and coalesce writes to applied state for the batch.

This commit batches raft command application where possible. The basic approach is to notice that many commands only "trivially" update the replica state machine. Trivial commands can be processed in a single batch by acting on a copy of the replica state. Non-trivial commands share the same logic but always commit alone as they for one reason or another rely on having a view of the replica or storage engine as of a specific log index. This commit also sneaks in another optimization which batching enables. Each command mutates a portion of replica state called the applied state which tracks a combination of the log index which has been applied and the MVCC stats of the range as of that application. Before this change each entry would update this applied state and each of those writes will end up in the WAL and mem-table just the be compacted away in L1. Now that commands are being applied to the storage engine in a single batch it is easy to only update the applied state for the last entry in the batch. For sequential writes this patch shows a considerable performance win. The below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128. ``` name old ops/s new ops/s delta KV0-throughput 22.1k ± 1% 32.8k ± 1% +48.59% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 7.15 ± 2% 6.00 ± 0% -16.08% (p=0.029 n=4+4) KV0-Avg 5.80 ± 0% 3.80 ± 0% -34.48% (p=0.029 n=4+4) ``` Due to the re-organization of logic in the change, the Replica.mu does not need to be acquired as many times during the application of a batch. In the common case it is now acquired exactly twice in the process of applying a batch whereas before it was acquired more than twice per entry. This should hopefully improve performance on large machines which experience mutex contention for a single range. This effect is visible on large machines. Below are results from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with concurrency 1024 and 16 initial splits. ``` name old ops/s new ops/s delta KV0-throughput 78.1k ± 1% 116.8k ± 5% +49.42% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 24.4 ± 3% 19.7 ± 7% -19.28% (p=0.029 n=4+4) KV0-Avg 12.6 ± 0% 7.5 ± 9% -40.87% (p=0.029 n=4+4) ``` Fixes cockroachdb#37426. Release note (performance improvement): Batch raft entry application and coalesce writes to applied state for the batch.

38568: storage: batch command application and coalesce applied state per batch r=nvanbenschoten,tbg a=ajwerner This commit batches raft command application where possible. The basic approach is to notice that many commands only "trivially" update the replica state machine. Trivial commands can be processed in a single batch by acting on a copy of the replica state. Non-trivial commands share the same logic but always commit alone as they for one reason or another rely on having a view of the replica or storage engine as of a specific log index. This commit also sneaks in another optimization which batching enables. Each command mutates a portion of replica state called the applied state which tracks a combination of the log index which has been applied and the MVCC stats of the range as of that application. Before this change each entry would update this applied state and each of those writes will end up in the WAL and mem-table just the be compacted away in L0. Now that commands are being applied to the storage engine in a single batch it is easy to only update the applied state for the last entry in the batch. For sequential writes this patch shows a considerable performance win. The below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128. ``` name old ops/s new ops/s delta KV0-throughput 22.1k ± 1% 32.8k ± 1% +48.59% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 7.15 ± 2% 6.00 ± 0% -16.08% (p=0.029 n=4+4) KV0-Avg 5.80 ± 0% 3.80 ± 0% -34.48% (p=0.029 n=4+4) ``` Due to the re-organization of logic in the change, the Replica.mu does not need to be acquired as many times during the application of a batch. In the common case it is now acquired exactly twice in the process of applying a batch whereas before it was acquired more than twice per entry. This should hopefully improve performance on large machines which experience mutex contention for a single range. These optimizations combine for a big win on large machines. Below are results from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with concurrency 1024 and 16 initial splits. ``` name old ops/s new ops/s delta KV0-throughput 78.1k ± 1% 116.8k ± 5% +49.42% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 24.4 ± 3% 19.7 ± 7% -19.28% (p=0.029 n=4+4) KV0-Avg 12.6 ± 0% 7.5 ± 9% -40.87% (p=0.029 n=4+4) ``` Fixes #37426. Release note (performance improvement): Batch raft entry application and coalesce writes to applied state for the batch. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels May 9, 2019

nvanbenschoten assigned ajwerner May 9, 2019

nvanbenschoten mentioned this issue Jun 20, 2019

storage: skip Pebble write-ahead log during Raft entry application #38322

Open

ajwerner mentioned this issue Jun 28, 2019

storage: ensure application of sideloaded entries is durable on followers before truncation #38566

Closed

ajwerner mentioned this issue Jun 28, 2019

storage: batch command application and coalesce applied state per batch #38568

Merged

craig bot closed this as completed in #38568 Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: batch Raft entry application across ready struct #37426

storage: batch Raft entry application across ready struct #37426

nvanbenschoten commented May 9, 2019

nvanbenschoten commented Jun 14, 2019

ajwerner commented Jun 14, 2019

nvanbenschoten commented Jun 14, 2019

petermattis commented Jun 14, 2019

nvanbenschoten commented Jun 15, 2019

ajwerner commented Jun 25, 2019

nvanbenschoten commented Jun 25, 2019

storage: batch Raft entry application across ready struct #37426

storage: batch Raft entry application across ready struct #37426

Comments

nvanbenschoten commented May 9, 2019

Implementation notes

Predicted win

nvanbenschoten commented Jun 14, 2019

ajwerner commented Jun 14, 2019

nvanbenschoten commented Jun 14, 2019

petermattis commented Jun 14, 2019

nvanbenschoten commented Jun 15, 2019

ajwerner commented Jun 25, 2019

nvanbenschoten commented Jun 25, 2019