perf: command commit latency is highly correlated with range count #30213

nvanbenschoten · 2018-09-13T19:40:36Z

On a cluster running TPC-C for a few days, I've noticed that the p99 command commit latency and the p99 log commit latency are both slowly growing. This growth seems to be highly correlated with the range count in the cluster.

Interestingly, TPC-C has a fixed amount of load, so it would appear that the range count itself is the only moving variable here. More ranges but a fixed amount of load would result in less batching of RocksDB writes because fewer writes would take place in the same Raft groups. However, our RocksDB commit pipeline attempts to transparently batch independent writes together, so this should help avoid this kind of issue:

cockroach/pkg/storage/engine/rocksdb.go

Lines 1752 to 1753 in 33c7d27

    
           var leader bool 
        
           c.pending, c.groupSize, leader = makeBatchGroup(c.pending, r, c.groupSize, maxBatchGroupSize)

I'd like to instrument this pipeline and see if there are any inefficiencies in it. Specifically, I'd like to check whether the pipeline remains full as the number of batches that it attempts to batch together grows. For instance, it may be that case that the write batch merging begins to take longer than the RocksDB writes themselves. This would allow for gaps in the pipeline where the RocksDB syncLoop remains idle.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2018-09-13T19:41:19Z

cc. @petermattis @a-robinson in case either has any immediate intuition about this.

petermattis · 2018-09-13T19:48:37Z

The increasing range count also correlates with increasing on disk data. You could eliminate range count as a factor by running a short run, then splitting the tables into twice as many ranges and running again.

…

On Thu, Sep 13, 2018 at 12:41 PM Nathan VanBenschoten < ***@***.***> wrote: cc. @petermattis <https://github.com/petermattis> @a-robinson <https://github.com/a-robinson> in case either has any immediate intuition about this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30213 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF6f93IrRgmhVtE61CVPvqqTamhgdHWSks5uarTsgaJpZM4WoFIu> .

nvanbenschoten · 2018-09-13T19:54:09Z

The increasing range count also correlates with increasing on disk data.

Does it? Sure we'll split ranges as the amount of on disk data grows, but ranges themselves don't create a non-negligible amount of on disk data and we see commit latency grow dramatically at the same time that the range count jumps. If this were just about on disk data size, I'd expect the command commit latency to increase more gradually, without any jumps.

Your suggestion about extra splitting to test this out is a good idea.

petermattis · 2018-09-14T14:42:28Z

Sure we'll split ranges as the amount of on disk data grows, but ranges themselves don't create a non-negligible amount of on disk data and we see commit latency grow dramatically at the same time that the range count jumps.

Ah, that is interesting. I didn't see that these correlated so precisely before (I was looking at the graphs on my phone). I have no explanation for that.

However, our RocksDB commit pipeline attempts to transparently batch independent writes together, so this should help avoid this kind of issue.

Btw, this batching really shouldn't be necessary as RocksDB performs similar batching internally. But as we saw recently (when it was accidentally broken) it provides a significant performance improvement. I mentioned that we're doing this batching to Andrew on the RocksDB team and he agreed it is curious that it provides a benefit. When I investigated this in the past, I could see that the RocksDB batching was very rarely grouping more than one batch together, while our batching was doing so very frequently. I never figured out what was going wrong on the RocksDB side. A bug in RocksDB? An interaction with cgo? There might be a performance win from figuring this out.

nvanbenschoten · 2018-09-14T17:49:48Z

The trend has continued:

Btw, this batching really shouldn't be necessary as RocksDB performs similar batching internally. But as we saw recently (when it was accidentally broken) it provides a significant performance improvement. I mentioned that we're doing this batching to Andrew on the RocksDB team and he agreed it is curious that it provides a benefit. When I investigated this in the past, I could see that the RocksDB batching was very rarely grouping more than one batch together, while our batching was doing so very frequently. I never figured out what was going wrong on the RocksDB side. A bug in RocksDB? An interaction with cgo? There might be a performance win from figuring this out.

Mind pointing me at the prior investigation and also at what I should be looking at in RocksDB? I'd like to dig in and make sure we're not leaving any perf on the table.

petermattis · 2018-09-14T17:52:59Z

Mind pointing me at the prior investigation and also at what I should be looking at in RocksDB? I'd like to dig in and make sure we're not leaving any perf on the table.

See #14138.

nvanbenschoten · 2018-09-20T00:42:11Z

I added some instrumentation into our RocksDB commit pipeline (in Go), but nothing really stood out. I measured different phases on the pipeline to get a feel for where time was being spent.

"wait latency" = the time that batches spend waiting to start committing. Between here and here
"bundle size" = the number of bundled write batches
"bundle latency" = the time a leader batch spends bundling all batches together. Between here and here
"commit latency" = the time a leader spends in commitInternal
"broadcast latency" = the time a leader spends broadcasting to previous waiters that they can proceed and to others in its group that they are done (if they aren't waiting for a sync). Between here and here
"sync latency" = the duration of a single syncLoop iteration

Nothing really stood out. I then took all nodes offline and performed compactions on them. When I brought them back online, I saw that log commit latencies were a lot faster. However, this speedup went away over the course of a day.

p99 wait latency, bundle size, bundle latency, commit latency, broadcast latency, sync latency

Some data is missing from before I added in the metrics.

p99 log commit latency and command commit latency over same period

Current Values

	p50	p99
wait lat	3.2us	2.5ms
bundle size	3	86
bundle lat	3.1us	67us
commit lat	56.3us	917us
broadcast lat	6.4us	54us
sync lat	3.2ms	61ms

Put together, this doesn't really tell us much about where the issue is, but it does disprove a number of theories I had. For instance, it's pretty clear that the cost of bundling batches isn't an issue. I think I need to go deeper and figure out why we're seeing such long tail latencies in the RocksDB sync loop.

nvanbenschoten · 2018-09-20T03:17:48Z

Divide everything in my previous comment by a factor of 3. I was using the custom chart page with a sum aggregator and without the graphs in "per node" mode.

nvanbenschoten · 2018-09-20T05:01:50Z

Here's an interesting chart: GC pause time vs. replica count.

I saw a few gcAssist calls in a CPU profile and found that GC related activity is accounting for 9.65% of CPU utilization on these nodes. It may just be because #27815 is fresh on my mind, but I'm getting suspicious that this is GC related. Specifically, I think the replica growth is resulting in more objects sitting in memory and slowing down the GC, which in turn is slowing down everything else. This is corroborated by inuse object heap profiles:

We can see that Replica-related memory is responsible for over a quarter of inuse objects. @spencerkimball has observed issues like this in his extreme replica count testing. It would be interesting to test out his "replica dehydration" wip branch on this cluster and see if it has any effect.

To start with, I'm going to pick at the low hanging fruit here. For instance, it looks like all pointers in storagebase.ReplicaState can be replaced with values. These are the cause of all the persistent allocations in StateLoader.Load, so it should be a quick 5% reduction in work the GC needs to do. @tschottdorf do you know why that struct was so pointer-happy to begin with? Was it just to make the assertions in handleReplicatedEvalResult easier to work with?

tbg · 2018-09-20T05:23:46Z

@tschottdorf do you know why that struct was so pointer-happy to begin with? Was it just to make the assertions in handleReplicatedEvalResult easier to work with?

Yes, I think that's basically the reason.

nvanbenschoten · 2018-09-20T18:12:21Z

It turns out that most of these pointers were introduced in #18689, which made proto fields nullable so that they would be omitted from the encoded proto. It's really unfortunate that these two concepts are tied together so tightly.

In storagebase.ReplicaState alone there are 8 pointers that we would be able to remove if it weren't for this. These pointers are almost certainly behaving poorly with the GC, which has to do work proportional to the number of live pointers every time it runs. We can easily have 10s of thousands of replicas, which means that storagebase.ReplicaState could be responsible for about a quarter million pointers that the GC needs to follow. To make matters worse, there are plenty of other objects that hang off or Replica. I'd estimate that on this 20k range cluster, the Replica struct alone is responsible for about 1 million pointers. Even if each pointer only takes a ns to process, we can see how GC pauses are growing to multi-millisecond chunks of time.

I'm going to do some digging with gctrace, but what I'd really like is a profile of the GC that maps mark/scan time back to the live object hierarchy.

petermattis · 2018-09-20T18:14:46Z

Is the lifetime of storagebase.ReplicaState known? Pool allocation removes both the allocation overhead, as well as reduces the number of objects that need to be scanned.

nvanbenschoten · 2018-09-20T18:28:01Z

Well, it's never deallocated, it just lives on the Replica for its entire lifetime. You're saying that if we allocate it from a pool it won't be scanned at all?

petermattis · 2018-09-20T18:53:39Z

Oh, I thought the issue with ReplicaState was its use in commands. You're thinking it is its use in Replica.mu.state?

petermattis · 2018-09-20T18:58:03Z

I wonder if the pointers in ReplicatedEvalResult are a problem. Looks like we allocate a RaftCommand for every command that is applied. Isn't that a lot of garbage containing a lot of pointers?

a-robinson · 2018-09-20T21:11:59Z

It's not one of the bigger offenders, but newReplicaStats could reasonably return an object on the stack rather than the heap.

Also, one of the two replicaStats objects on each replica could be trivially removed -- we aren't using keys-written-per-second for anything in v2.1.

a-robinson · 2018-09-20T21:57:54Z

It's not one of the bigger offenders, but newReplicaStats could reasonably return an object on the stack rather than the heap.

For what it's worth, I just tried this out and performance got 1.3% worse on kv95 both with and without pre-splitting.

nvanbenschoten · 2018-09-21T04:16:46Z

I spent some time with this cluster in the Go execution tracer and I've convinced myself that we are seeing a GC-related slowdown.

Here's a representative trace that includes a GC event:

We can see a number of interesting things from this screenshot:

the GC ran for about 220ms
while running concurrently, the GC used 4 dedicated processors out of the 16 total
there were dramatically fewer network events processed while the GC was running, even concurrently
the number of runnable goroutines increased while the GC was running concurrently

Let's zoom in a little bit to get more info:

More notes:

the initial stop-the-world sweep termination phase of the GC lasted 209us
the concluding stop-the-world mark termination phase of the GC lasted 334us
even with 4 dedicated processors, the GC relied heavily on Mark Assistance by all other processors while running concurrently. In fact, during this "concurrent" GC period, goroutines are spending more time assisting the GC than doing actual work.
a few of these MARK ASSIST periods were so long (> 10ms) that they stopped without finishing. I believe this means that the corresponding goroutines were heavily "in debt".
(not pictured) even after the GC concluded, it still depended on other goroutines assistance in SWEEPing for 10-20us at a time.

I've included the trace here in case anyone else wants to take a look.

This all looks pretty bad, especially compared to clusters where I was seeing normal latencies. In those traces, concurrent GC runs usually lasted around 5-10ms, had somewhat less assistance (MARK ASSIST) from other gouroutines, and had stop-the-world phases that lasted about half as long.

I ran a few experiments where I played around with the GOGC environment variable (see here and here). I tripled its value to 300% on the same cluster and saw an interesting trend. I tripled it again to 900% and, sure enough, latencies began fluctuating. For about a few second, they would hold in the hundred-millisecond range, which is where we see them on a healthy cluster. Then a GC would kick in and they would jump up to the multi-second latencies I've been seeing here.

GOGC=100, GOGC=300, GOGC=900

GOGC=300, GOGC=900 zoomed in

GOGC=100, GOGC=300, GOGC=900

From this experiment and from the profiles I posted above, I'm reasonably confident that the perf degredation is due to an increased garbage collection cost as ranges split and create more pointer-happy objects which stick around in memory. I think it's time to begin a allocation and pointer hunt.

The one thing that's doesn't add up from the trace is that there seem to be a few processors (14 & 15) who remain idle for a few ms during the concurrent mark phase. That doesn't make a lot of sense to me, as I'd expect processors to remain busy even if they are spending most of their time assisting the GC. Perhaps this is a side-effect of the drop in network events during the GC period. I'd like to find out more about this.

spencerkimball · 2018-09-21T07:38:10Z

Some recent changes to replica metrics computation cause multiple arrays to be allocated for every range. That can’t be helping either. I’m sending out a PR today to remove those. Also, I’m rehabilitating the replica dehydration change but I’m going to need fast review support because when it sits for a couple of weeks, it rots very quickly. There’s a lot going on in core. What’s the workload here though? Is this a uniform distribution over all replicas, so the dehydration change will not necessarily help much?

…

On Fri, Sep 21, 2018 at 12:17 AM Nathan VanBenschoten < ***@***.***> wrote: I spent some time with this cluster in the Go execution tracer and I've convinced myself that we are seeing a GC-related slowdown. Here's a representative trace that includes a GC event: [image: screen shot 2018-09-20 at 11 29 50 pm] <https://user-images.githubusercontent.com/5438456/45859232-2ebc3900-bd2f-11e8-8ba4-22ea68a3c5ed.png> We can see a number of interesting things from this screenshot: 1. the GC ran for about 220ms 2. while running concurrently, the GC used 4 dedicated processors out of the 16 total 3. there were dramatically fewer network events processed while the GC was running, even concurrently 4. the number of runnable goroutines increased while the GC was running concurrently Let's zoom in a little bit to get more info: [image: screen shot 2018-09-20 at 11 30 44 pm] <https://user-images.githubusercontent.com/5438456/45859353-d20d4e00-bd2f-11e8-84c2-b0a1bc78e26e.png> More notes: 1. the initial stop-the-world sweep termination phase of the GC lasted 209us 2. the concluding stop-the-world mark termination phase of the GC lasted 334us 3. even with 4 dedicated processors, the GC relied heavily on Mark Assistance by all other processors while running concurrently. In fact, during this "concurrent" GC period, goroutines are spending more time assisting the GC than doing actual work. 4. a few of these MARK ASSIST periods were so long (> 10ms) that they stopped without finishing. I believe this means that the corresponding goroutines were heavily "in debt". 5. (not pictured) even after the GC concluded, it still depended on other goroutines assistance in SWEEPing for 10-20us at a time. I've included the trace here <https://github.com/cockroachdb/cockroach/files/2403826/trace.out.zip> in case anyone else wants to take a look. This all looks pretty bad, especially compared to clusters where I was seeing normal latencies. In those traces, concurrent GC runs usually lasted around 5-10ms, had somewhat less assistance (MARK ASSIST) from other gouroutines, and had stop-the-world phases that lasted about half as long. I ran a few experiments where I played around with the GOGC environment variable (see here <https://golang.org/pkg/runtime/#hdr-Environment_Variables> and here <https://golang.org/pkg/runtime/debug/#SetGCPercent>). I tripled its value to 300% on the same cluster and saw an interesting trend. I tripled it again to 900% and, sure enough, latencies began fluctuating. For about a few second, they would hold in the hundred-millisecond range, which is where we see them on a healthy cluster. Then a GC would kick in and they would jump up to the multi-second latencies I've been seeing here. [image: screen shot 2018-09-20 at 11 25 43 pm] <https://user-images.githubusercontent.com/5438456/45859783-e9e5d180-bd31-11e8-9757-6dfafc5aad92.png> *GOGC=100, GOGC=300, GOGC=900* [image: screen shot 2018-09-20 at 11 25 55 pm] <https://user-images.githubusercontent.com/5438456/45859802-084bcd00-bd32-11e8-9417-a69322cd7ef4.png> *GOGC=300, GOGC=900 zoomed in* [image: screen shot 2018-09-20 at 11 26 18 pm] <https://user-images.githubusercontent.com/5438456/45859806-0c77ea80-bd32-11e8-9d8c-4aacc8dd017e.png> *GOGC=100, GOGC=300, GOGC=900* From this experiment and from the profiles I posted above, I'm reasonably confident that the perf degredation is due to an increased garbage collection cost as ranges split and create more pointer-happy objects which stick around in memory. I think it's time to begin a allocation and pointer hunt. The one thing that's doesn't add up from the trace is that there seem to be a few processors (14 & 15) who remain idle for a few ms during the concurrent mark phase. That doesn't make a lot of sense to me, as I'd expect processors to remain busy even if they are spending most of their time assisting the GC. Perhaps this is a side-effect of the drop in network events during the GC period. I'd like to find out more about this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30213 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTVY3bPBk5sMSEBrg6fuNBT026Jodks5udGhGgaJpZM4WoFIu> .

tbg · 2018-09-21T09:07:11Z

but I’m going to need fast review support because when it sits for a couple of weeks, it rots very quickly. There’s a lot going on in core.

I'd hold off on that kind of change until we're reasonably sure we're not cherry-picking anything urgent onto release-2.1, i.e. at least until Oct 10th.

nvanbenschoten · 2018-09-21T15:56:54Z

What’s the workload here though? Is this a uniform distribution over all
replicas, so the dehydration change will not necessarily help much?

This is running TPC-C 1k on a long-running three-node cluster that can handle up to TPC-C 1500. TPC-C's load distribution isn't perfectly uniform. For instance, ~72% of replicas are quiesced at any given time. However, it's hard to tell how much churn there is in this quiesced replica set.

ridwanmsharif · 2018-10-24T19:00:28Z

I think some things I've run into have something to do with this. Here's what I see on a kv workload without splits:

and here's what I see with 30 pre splits on the same workload with the same configuration:

You'll see that the QPS is significantly lower with splits and the Disk Write Ops is much higher.
I don't know what to make of this yet but the problem can be seen much clearly with load based splitting as seen in #31819

nvanbenschoten · 2018-12-16T19:44:57Z

I took another look at this to try to see if past explorations missed anything obvious. In doing so, I found some interesting behavior.

The effect can be seen by comparing the following two workloads:

workload run kv --init --read-percent=0 --splits=0

and

workload run kv --init --read-percent=0 --splits=3

The latter workload results in about 20% lower throughput. This lines up with what @ridwanmsharif observed above. Interestingly, this perf difference disappears when I set kv.raft_log.synchronize = false. In fact, it actually inverts, which is what we would expect with a write-heavy workload as the number of Ranges increases. This indicates that the issue has to do with synchronizing the RocksDB WAL when concurrently appending to mutliple Raft logs.

We can see from a custom graph of raft.process.logcommit.latency-p50 that the splits have a direct impact on the median latency of these WAL syncs:

Before the splits, the median log sync latency is around 230us. After 3 splits, it jumps to around 430us. This is very reproducible. My theory for what's happening is similar to what is discussed above. Log syncs are implemented using a single goroutine that loops around and continuously calls FlushWAL. Before the splits, the single range never needs to wait on an ongoing sync, so its sync is always kicked off immediately. After the splits, each range has to wait for previous syncs to complete before their sync can start. We would expect the average wait time to be 230/2 = 115us. This alone comes close to explaining the increase in sync latency.

To gain more confidence in this theory, I added a 115us sleep into the sync loop. This resulted in almost exactly a 20% throughput reduction in the no-split scenario, which lines up with the range-split scenario.

I'm thinking through how to address this. One idea I had was to run multiple syncLoops, but that didn't appear to have any effect. We're going to need to gain a better understanding of the parallelism permitted by FlushWAL to determine how we can improve this situation. It's also possible that this isn't something that we can easily fix and that the slowdown from a single Range to a few Ranges is simply something we need to accept.

However, this effect doesn't explain the continued perf degredation we saw on long running clusters. Even if each log sync requires waiting for a whole FlushWAL iteration before issuing one, this latency shouldn't continue to scale indefinitely. That issue is more worrying to me and requires further exploration.

petermattis · 2018-12-16T20:44:22Z

Interesting. There is no concurrency in FlushWAL as the WAL is a single file, rotated every time the memtable is flushed. I'm failing to imagine what concurrency would look like there. What we do achieve is that we don't call FlushWAL for every operation that needs to be synced, but sync operations in groups.

FYI, I'm seeing throughput increase for --splits=3 vs --splits=0 on a local 1 node cluster:

~ workload run kv --init --read-percent=0 --splits=0 --duration=10s
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0         116940        11693.7      1.4      1.4      1.8      2.2     14.2

~ workload run kv --init --read-percent=0 --splits=3 --duration=10s
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0         135885        13588.2      1.2      1.2      1.6      2.2     25.2

Are you running on GCE machines? Are you running a single-node cluster or a multi-node cluster?

petermattis · 2018-12-16T21:03:55Z

Interestingly, the results are different on a single-node GCE cluster:

roachprod ssh peter-test:2 -- ./workload run kv --init --read-percent=0 --splits=0 --duration=10s {pgurl:1}
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0          68686         6868.2      2.3      2.2      3.7      5.0     17.8

roachprod ssh peter-test:2 -- ./workload run kv --init --read-percent=0 --splits=3 --duration=10s {pgurl:1}
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0          55802         5580.0      2.9      2.5      6.3      7.9     13.1

This is probably due to the slower syncs on GCE machines.

nvanbenschoten · 2018-12-17T01:28:59Z

I was running this on my Macbook. It's interesting that you didn't see a degradation on yours but did on a GCE VM. You're probably right that its due to slower syncs.

There is no concurrency in FlushWAL as the WAL is a single file, rotated every time the memtable is flushed.

The only degree of concurrency I see is that we could pipeline FlushWAL(false /* sync */) and SyncWAL() operations. I'm not sure that would get us very much though. I don't have a great feel for the relative costs of flushing the memtable vs. syncing the WAL file, but I would guess that the latter dominates the cost of syncLoop iterations.

petermattis · 2018-12-17T01:43:44Z

FlushWAL does not touch the memtable. It takes data that is buffered in user space and writes it to the WAL file. If sync is true, it then calls fsync (or fdatasync) on the file.

Nothing obvious is coming to mind about what can be done about this. Note that cockroach debug syncbench also shows the problem:

pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 1 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     1575.1      0.6      0.5      1.6      2.6      7.1
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 2 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     1764.0      0.7      0.9      3.7      5.2      5.8
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 4 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     3498.1      1.3      0.9      3.3      5.2      5.8
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 8 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     6853.8      2.6      0.9      3.1      4.5      6.6
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 16 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s    11214.2      4.3      1.0      3.9      6.3     11.0

Notice how performance doesn't increase much when going from 1 to 2 concurrent writers, but performance almost doubles when going from 2 to 4 and from 4 to 8.

petermattis · 2018-12-17T04:07:25Z

Thinking about this more, it is curious that performance decreases when moving from 1 to 4 ranges. Yes, each sync commit will have a greater likelihood of having to wait for a previous sync to finish, but we should be seeing as much or greater parallelism in the 4 range case. One difference is there is only a single Raft processing goroutine in the 1 range case. It is surprising that makes a difference, but perhaps something can be learned there.

nvanbenschoten · 2018-12-17T19:56:57Z

FlushWAL does not touch the memtable. It takes data that is buffered in user space and writes it to the WAL file. If sync is true, it then calls fsync (or fdatasync) on the file.

I was misreading that. The idea still holds though - we could pipeline the buffer flush and the file sync.

it is curious that performance decreases when moving from 1 to 4 ranges.

Very curious. We do eventually achieve linear scaling though, which indicates that the degradation as ranges continue to split is due to some other effect.

petermattis · 2018-12-17T20:24:25Z

As mentioned in person, I'll pull on this thread some more to see if anything unravels.

petermattis · 2018-12-18T15:29:16Z

It is unsurprising that the number of WAL syncs increases as we move from 1 range to 2 ranges. We now have 2 Raft groups and we need to sync the log commits. But look how the number of syncs increases as we further increase the number of ranges:

ranges    syncs
     1    26970
     2    38950
     4    43330
     8    41820
    16    39290
    32    40210

This is from a 10s kv0 run (workload run kv --init --read-percent=0 --duration=10s) using some simple instrumentation.

nvanbenschoten · 2018-12-18T18:07:30Z

So that lines up with perf dropoff we see, right? The first few ranges don't result in any WAL syncs being batched together, so they hurt performance. Above 4 ranges, we start seeing additional syncs coalesce with existing ones, so they no longer hurt performance and the additional concurrency they provide begins to dominate and improve overall performance.

petermattis · 2018-12-20T17:42:21Z

So that lines up with perf dropoff we see, right?

Yep. I've been doing a bit of experimentation here, but I'm not seeing anything that can help. If we delay the WAL syncs slightly to allow them to batch, I can decrease the number of syncs for the 2 and 4 range scenarios, but doing so only hurts performance.

awoods187 · 2019-02-04T20:10:07Z

I don't know if this is just noise but check this out on EBS vs. SSD
EBS
Warehouses | tpmC | Efficiency

warehouses	tpmC	efficency	avg latency	p50 latency	p90 latency	p95 latency	p99 latency	pMax latency
2150	21171	77%	4806.2	3758.1	7784.6	12884.9	30064.8	103079.2
2150	21250.3	77%	4908.6	3623.9	8321.5	12348	30064.8	103079.2
2150	19611.8	71%	6650.2	5637.1	10737.4	16106.1	36507.2	103079.2
2150	18841.5	68%	7760.5	7247.8	11811.2	16643	40802.2	103079.2
2150	18571.2	67%	8035.4	7516.2	12348	18253.6	40802.2	103079.2

vs. SSD

warehouses	tpmC	efficency	avg latency	p50 latency	p90 latency	p95 latency	p99 latency
24081.2	87%	3262	2684.4	6442.5	7516.2	10200.5	49392.1
24900.3	90%	2528.1	2013.3	5368.7	6174	8053.1	40802.2
22502.8	81%	4837.7	4026.5	8321.5	10200.5	19327.4	103079.2
22658.3	82%	4745.9	3892.3	8321.5	10200.5	20401.1	103079.2
22602.3	82%	4621.2	3758.1	8053.1	10200.5	20401.1	103079.2

Both have the drop off but it seems to be worse on EBS (and continually growing worse)

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. labels Sep 13, 2018

nvanbenschoten added this to the 2.1 milestone Sep 13, 2018

nvanbenschoten self-assigned this Sep 13, 2018

This was referenced Sep 21, 2018

perf: avoid common allocations in TPC-C to reduce impact of GC #30512

Closed

workload/tpcc: use multiple column families to avoid contention #30624

Merged

nvanbenschoten added the A-coreperf label Oct 2, 2018

nvanbenschoten mentioned this issue Oct 4, 2018

sql/kv: tableReader should scan ranges in parallel #30943

Closed

petermattis removed this from the 2.1 milestone Oct 5, 2018

nvanbenschoten removed the A-coreperf label Oct 12, 2018

ridwanmsharif mentioned this issue Oct 24, 2018

storage: Load Based Splitting Experiments #31819

Closed

nvanbenschoten mentioned this issue Oct 29, 2018

[dnm] storage/engine: add metrics to the RocksDB commit pipeline #31981

Closed

awoods187 mentioned this issue Feb 5, 2019

Expand performance testing of large volume sizes #34554

Closed

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: command commit latency is highly correlated with range count #30213

perf: command commit latency is highly correlated with range count #30213

nvanbenschoten commented Sep 13, 2018

nvanbenschoten commented Sep 13, 2018

petermattis commented Sep 13, 2018 via email

nvanbenschoten commented Sep 13, 2018

petermattis commented Sep 14, 2018

nvanbenschoten commented Sep 14, 2018

petermattis commented Sep 14, 2018

nvanbenschoten commented Sep 20, 2018

nvanbenschoten commented Sep 20, 2018

nvanbenschoten commented Sep 20, 2018

tbg commented Sep 20, 2018

nvanbenschoten commented Sep 20, 2018

petermattis commented Sep 20, 2018

nvanbenschoten commented Sep 20, 2018

petermattis commented Sep 20, 2018

petermattis commented Sep 20, 2018

a-robinson commented Sep 20, 2018

a-robinson commented Sep 20, 2018

nvanbenschoten commented Sep 21, 2018

spencerkimball commented Sep 21, 2018 via email

tbg commented Sep 21, 2018

nvanbenschoten commented Sep 21, 2018

ridwanmsharif commented Oct 24, 2018

nvanbenschoten commented Dec 16, 2018

petermattis commented Dec 16, 2018

petermattis commented Dec 16, 2018

nvanbenschoten commented Dec 17, 2018

petermattis commented Dec 17, 2018

petermattis commented Dec 17, 2018

nvanbenschoten commented Dec 17, 2018

petermattis commented Dec 17, 2018

petermattis commented Dec 18, 2018

nvanbenschoten commented Dec 18, 2018

petermattis commented Dec 20, 2018

awoods187 commented Feb 4, 2019

perf: command commit latency is highly correlated with range count #30213

perf: command commit latency is highly correlated with range count #30213

Comments

nvanbenschoten commented Sep 13, 2018

nvanbenschoten commented Sep 13, 2018

petermattis commented Sep 13, 2018 via email

nvanbenschoten commented Sep 13, 2018

petermattis commented Sep 14, 2018

nvanbenschoten commented Sep 14, 2018

petermattis commented Sep 14, 2018

nvanbenschoten commented Sep 20, 2018

Current Values

nvanbenschoten commented Sep 20, 2018

nvanbenschoten commented Sep 20, 2018

tbg commented Sep 20, 2018

nvanbenschoten commented Sep 20, 2018

petermattis commented Sep 20, 2018

nvanbenschoten commented Sep 20, 2018

petermattis commented Sep 20, 2018

petermattis commented Sep 20, 2018

a-robinson commented Sep 20, 2018

a-robinson commented Sep 20, 2018

nvanbenschoten commented Sep 21, 2018

spencerkimball commented Sep 21, 2018 via email

tbg commented Sep 21, 2018

nvanbenschoten commented Sep 21, 2018

ridwanmsharif commented Oct 24, 2018

nvanbenschoten commented Dec 16, 2018

petermattis commented Dec 16, 2018

petermattis commented Dec 16, 2018

nvanbenschoten commented Dec 17, 2018

petermattis commented Dec 17, 2018

petermattis commented Dec 17, 2018

nvanbenschoten commented Dec 17, 2018

petermattis commented Dec 17, 2018

petermattis commented Dec 18, 2018

nvanbenschoten commented Dec 18, 2018

petermattis commented Dec 20, 2018

awoods187 commented Feb 4, 2019