Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: batch Raft entry application across ready struct #37426

Closed
nvanbenschoten opened this issue May 9, 2019 · 7 comments · Fixed by #38568
Closed

storage: batch Raft entry application across ready struct #37426

nvanbenschoten opened this issue May 9, 2019 · 7 comments · Fixed by #38568
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@nvanbenschoten
Copy link
Member

Old prototype: #15648.

The Raft proposal pipeline currently applies Raft entries one-by-one to our storage engine in handleRaftReadyRaftMuLocked. We've seen time and time again that batching writes to RocksDB provides a sizable speedup, so this is a perfect candidate for batching. This will come with three concrete wins:

  1. batching the writes reduces the total amount of work incident on the system
  2. batching the writes reduces the number of cgo calls performed by Raft processing
  3. batching the writes should speed up a single iteration of handleRaftReadyRaftMuLocked, allowing the Range to schedule and run the next iteration sooner

Put together, this should be a nice win to the amount of concurrency supported by the Raft proposal pipeline.

Implementation notes

To make this change we will need to transpose the order of operations in the committed entries loop and in processRaftCommand. Instead of the control flow looking like:

for each entry:
    check if failed
    if not:
        apply to engine
    ack client, release latches

it will look something like

for each entry:
    check if failed
    if not:
        apply to batch
commit batch
for each entry:
    ack client, release latches

Care will be needed around handling in-memory side effects of Raft entries correctly.

Predicted win

The improvement on single-range throughput as predicted by rafttoy is:

name                        old time/op    new time/op    delta
Raft/conc=256/bytes=256-16    15.6µs ± 5%    14.6µs ±15%  -6.87%  (p=0.046 n=16+20)

name                        old speed      new speed      delta
Raft/conc=256/bytes=256-16  16.4MB/s ± 5%  17.7MB/s ±17%  +8.30%  (p=0.043 n=16+20)

It's worth noting that rafttoy is using pebble instead of RocksDB. Writing to pebble doesn't incur a cgo call, which means that batching writes isn't quite as critical. This means that it's reasonable to expect the improvement to throughput in Cockroach will be even greater than this prediction.

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels May 9, 2019
@nvanbenschoten
Copy link
Member Author

An optimization that falls out cleanly from this is that we can write the RangeAppliedStateKey (the combination of the RaftAppliedIndex, LeaseAppliedIndex, and RangeStats) only once per batch of committed entries instead of once per committed entry.

@ajwerner
Copy link
Contributor

If a key is re-written many times in the same batch, do all of those writes end up in the memtable when the batch gets committed?

@nvanbenschoten
Copy link
Member Author

do all of those writes end up in the memtable when the batch gets committed?

Yes. According to @ajkr they don't get collapsed until the memtable is compacted into L0.

@petermattis
Copy link
Collaborator

Yes. According to @ajkr they don't get collapsed until the memtable is compacted into L0.

Yep. I was just about to say the same thing.

@nvanbenschoten
Copy link
Member Author

I've also seen lock contention here in Replica.handleReplicatedEvalResult interact poorly with above-Raft processing. If we were to move this work to once-per-Raft ready iteration instead of once-per-committed entry, I'm guessing we'd see some movement on workloads with single-Range hotspots.

@ajwerner
Copy link
Contributor

I started typing this yesterday and there are some complications for which I want to propose a minor alteration.

The logic inside of Replica.handleReplicatedEvalResult() expects on-disk state to mirror in-memory state and also expects that the in-memory state reflects the on-disk state exactly upon application of the current command. In order to avoid majorly changing the existing logic I propose that we only batch commands which have a "trivial" ReplicatedEvalResult This is roughly the set of commands for which shouldAssert in handleReplicatedEvalResult false. This would excludes splits, merges, lease transfers, replica changes, any changes to the descriptor, etc.

This way all that needs to be accumulated as far as in-memory state goes should be the stats and and the index values which can be accumulated and then set atomically in the replica state when the batch commits and provides a clear location to write the RangeAppliedState. Commands which have a non-trivial ReplicatedEvalResult can be applied in their own "batch". These commands ought to be rare so this should provide roughly all of the win with less complexity. Sound reasonable?

@nvanbenschoten
Copy link
Member Author

Discussed in person. This SGTM.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 1, 2019
This commit batches raft command application where possible. The basic
approach is to notice that many commands only "trivially" update the replica
state machine. Trivial commands can be processed in a single batch by acting
on a copy of the replica state. Non-trivial commands share the same logic but
always commit alone as they for one reason or another rely on having a view of
the replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC
stats of the range as of that application. Before this change each entry would
update this applied state and each of those writes will end up in the WAL,
mem-table, and L0 just the be compacted away in L1. Now that commands are
being applied to the storage engine in a single batch it is easy to only
update the applied state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 64.

```
name            old ops/s   new ops/s   delta
KV0-throughput  20.2k ± 1%  25.2k ± 1%  +24.94%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          4.45 ± 8%   3.70 ± 5%  -16.85%  (p=0.029 n=4+4)
KV0-Avg          3.10 ± 0%   2.50 ± 0%  -19.35%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch whereas
before it was acquired more than twice per entry. This should hopefully improve
performance on large machines which experience mutex contention for a single
range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 6, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 8, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 8, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 9, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 10, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 11, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 11, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 12, 2019
This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
craig bot pushed a commit that referenced this issue Jul 12, 2019
38568: storage: batch command application and coalesce applied state per batch  r=nvanbenschoten,tbg a=ajwerner

This commit batches raft command application where possible. The basic
approach is to notice that many commands only "trivially" update the
replica state machine. Trivial commands can be processed in a single
batch by acting on a copy of the replica state. Non-trivial commands
share the same logic but always commit alone as they for one reason or
another rely on having a view of the replica or storage engine as of a
specific log index.

This commit also sneaks in another optimization which batching enables.
Each command mutates a portion of replica state called the applied state
which tracks a combination of the log index which has been applied and the
MVCC stats of the range as of that application. Before this change each
entry would update this applied state and each of those writes will end up
in the WAL and mem-table just the be compacted away in L0. Now that
commands are being applied to the storage engine in a single batch it is easy
to only update the applied state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The below
benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch whereas
before it was acquired more than twice per entry. This should hopefully improve
performance on large machines which experience mutex contention for a single
range.

These optimizations combine for a big win on large machines. Below are results 
from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes #37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig craig bot closed this as completed in #38568 Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants