Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: command commit latency is highly correlated with range count #30213

Closed
nvanbenschoten opened this issue Sep 13, 2018 · 34 comments
Closed
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale

Comments

@nvanbenschoten
Copy link
Member

On a cluster running TPC-C for a few days, I've noticed that the p99 command commit latency and the p99 log commit latency are both slowly growing. This growth seems to be highly correlated with the range count in the cluster.

screenshot_2018-09-13 custom chart debug cockroach console

screenshot_2018-09-13 custom chart debug cockroach console 1

Interestingly, TPC-C has a fixed amount of load, so it would appear that the range count itself is the only moving variable here. More ranges but a fixed amount of load would result in less batching of RocksDB writes because fewer writes would take place in the same Raft groups. However, our RocksDB commit pipeline attempts to transparently batch independent writes together, so this should help avoid this kind of issue:

var leader bool
c.pending, c.groupSize, leader = makeBatchGroup(c.pending, r, c.groupSize, maxBatchGroupSize)

I'd like to instrument this pipeline and see if there are any inefficiencies in it. Specifically, I'd like to check whether the pipeline remains full as the number of batches that it attempts to batch together grows. For instance, it may be that case that the write batch merging begins to take longer than the RocksDB writes themselves. This would allow for gaps in the pipeline where the RocksDB syncLoop remains idle.

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. labels Sep 13, 2018
@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Sep 13, 2018
@nvanbenschoten nvanbenschoten self-assigned this Sep 13, 2018
@nvanbenschoten
Copy link
Member Author

cc. @petermattis @a-robinson in case either has any immediate intuition about this.

@petermattis
Copy link
Collaborator

petermattis commented Sep 13, 2018 via email

@nvanbenschoten
Copy link
Member Author

The increasing range count also correlates with increasing on disk data.

Does it? Sure we'll split ranges as the amount of on disk data grows, but ranges themselves don't create a non-negligible amount of on disk data and we see commit latency grow dramatically at the same time that the range count jumps. If this were just about on disk data size, I'd expect the command commit latency to increase more gradually, without any jumps.

Your suggestion about extra splitting to test this out is a good idea.

@petermattis
Copy link
Collaborator

Sure we'll split ranges as the amount of on disk data grows, but ranges themselves don't create a non-negligible amount of on disk data and we see commit latency grow dramatically at the same time that the range count jumps.

Ah, that is interesting. I didn't see that these correlated so precisely before (I was looking at the graphs on my phone). I have no explanation for that.

However, our RocksDB commit pipeline attempts to transparently batch independent writes together, so this should help avoid this kind of issue.

Btw, this batching really shouldn't be necessary as RocksDB performs similar batching internally. But as we saw recently (when it was accidentally broken) it provides a significant performance improvement. I mentioned that we're doing this batching to Andrew on the RocksDB team and he agreed it is curious that it provides a benefit. When I investigated this in the past, I could see that the RocksDB batching was very rarely grouping more than one batch together, while our batching was doing so very frequently. I never figured out what was going wrong on the RocksDB side. A bug in RocksDB? An interaction with cgo? There might be a performance win from figuring this out.

@nvanbenschoten
Copy link
Member Author

The trend has continued:

screenshot_2018-09-14 custom chart debug cockroach console

screenshot_2018-09-14 custom chart debug cockroach console 1

Btw, this batching really shouldn't be necessary as RocksDB performs similar batching internally. But as we saw recently (when it was accidentally broken) it provides a significant performance improvement. I mentioned that we're doing this batching to Andrew on the RocksDB team and he agreed it is curious that it provides a benefit. When I investigated this in the past, I could see that the RocksDB batching was very rarely grouping more than one batch together, while our batching was doing so very frequently. I never figured out what was going wrong on the RocksDB side. A bug in RocksDB? An interaction with cgo? There might be a performance win from figuring this out.

Mind pointing me at the prior investigation and also at what I should be looking at in RocksDB? I'd like to dig in and make sure we're not leaving any perf on the table.

@petermattis
Copy link
Collaborator

Mind pointing me at the prior investigation and also at what I should be looking at in RocksDB? I'd like to dig in and make sure we're not leaving any perf on the table.

See #14138.

@nvanbenschoten
Copy link
Member Author

I added some instrumentation into our RocksDB commit pipeline (in Go), but nothing really stood out. I measured different phases on the pipeline to get a feel for where time was being spent.

  • "wait latency" = the time that batches spend waiting to start committing. Between here and here
  • "bundle size" = the number of bundled write batches
  • "bundle latency" = the time a leader batch spends bundling all batches together. Between here and here
  • "commit latency" = the time a leader spends in commitInternal
  • "broadcast latency" = the time a leader spends broadcasting to previous waiters that they can proceed and to others in its group that they are done (if they aren't waiting for a sync). Between here and here
  • "sync latency" = the duration of a single syncLoop iteration

Nothing really stood out. I then took all nodes offline and performed compactions on them. When I brought them back online, I saw that log commit latencies were a lot faster. However, this speedup went away over the course of a day.

screen shot 2018-09-19 at 8 23 52 pm

p99 wait latency, bundle size, bundle latency, commit latency, broadcast latency, sync latency

Some data is missing from before I added in the metrics.

screen shot 2018-09-19 at 8 21 47 pm

p99 log commit latency and command commit latency over same period

Current Values

p50 p99
wait lat 3.2us 2.5ms
bundle size 3 86
bundle lat 3.1us 67us
commit lat 56.3us 917us
broadcast lat 6.4us 54us
sync lat 3.2ms 61ms

Put together, this doesn't really tell us much about where the issue is, but it does disprove a number of theories I had. For instance, it's pretty clear that the cost of bundling batches isn't an issue. I think I need to go deeper and figure out why we're seeing such long tail latencies in the RocksDB sync loop.

@nvanbenschoten
Copy link
Member Author

Divide everything in my previous comment by a factor of 3. I was using the custom chart page with a sum aggregator and without the graphs in "per node" mode.

@nvanbenschoten
Copy link
Member Author

Here's an interesting chart: GC pause time vs. replica count.

screen shot 2018-09-20 at 12 47 08 am

I saw a few gcAssist calls in a CPU profile and found that GC related activity is accounting for 9.65% of CPU utilization on these nodes. It may just be because #27815 is fresh on my mind, but I'm getting suspicious that this is GC related. Specifically, I think the replica growth is resulting in more objects sitting in memory and slowing down the GC, which in turn is slowing down everything else. This is corroborated by inuse object heap profiles:

screen shot 2018-09-20 at 12 28 55 am

We can see that Replica-related memory is responsible for over a quarter of inuse objects. @spencerkimball has observed issues like this in his extreme replica count testing. It would be interesting to test out his "replica dehydration" wip branch on this cluster and see if it has any effect.

To start with, I'm going to pick at the low hanging fruit here. For instance, it looks like all pointers in storagebase.ReplicaState can be replaced with values. These are the cause of all the persistent allocations in StateLoader.Load, so it should be a quick 5% reduction in work the GC needs to do. @tschottdorf do you know why that struct was so pointer-happy to begin with? Was it just to make the assertions in handleReplicatedEvalResult easier to work with?

@tbg
Copy link
Member

tbg commented Sep 20, 2018

@tschottdorf do you know why that struct was so pointer-happy to begin with? Was it just to make the assertions in handleReplicatedEvalResult easier to work with?

Yes, I think that's basically the reason.

@nvanbenschoten
Copy link
Member Author

It turns out that most of these pointers were introduced in #18689, which made proto fields nullable so that they would be omitted from the encoded proto. It's really unfortunate that these two concepts are tied together so tightly.

In storagebase.ReplicaState alone there are 8 pointers that we would be able to remove if it weren't for this. These pointers are almost certainly behaving poorly with the GC, which has to do work proportional to the number of live pointers every time it runs. We can easily have 10s of thousands of replicas, which means that storagebase.ReplicaState could be responsible for about a quarter million pointers that the GC needs to follow. To make matters worse, there are plenty of other objects that hang off or Replica. I'd estimate that on this 20k range cluster, the Replica struct alone is responsible for about 1 million pointers. Even if each pointer only takes a ns to process, we can see how GC pauses are growing to multi-millisecond chunks of time.

I'm going to do some digging with gctrace, but what I'd really like is a profile of the GC that maps mark/scan time back to the live object hierarchy.

@petermattis
Copy link
Collaborator

Is the lifetime of storagebase.ReplicaState known? Pool allocation removes both the allocation overhead, as well as reduces the number of objects that need to be scanned.

@nvanbenschoten
Copy link
Member Author

Well, it's never deallocated, it just lives on the Replica for its entire lifetime. You're saying that if we allocate it from a pool it won't be scanned at all?

@petermattis
Copy link
Collaborator

Oh, I thought the issue with ReplicaState was its use in commands. You're thinking it is its use in Replica.mu.state?

@petermattis
Copy link
Collaborator

I wonder if the pointers in ReplicatedEvalResult are a problem. Looks like we allocate a RaftCommand for every command that is applied. Isn't that a lot of garbage containing a lot of pointers?

@a-robinson
Copy link
Contributor

It's not one of the bigger offenders, but newReplicaStats could reasonably return an object on the stack rather than the heap.

Also, one of the two replicaStats objects on each replica could be trivially removed -- we aren't using keys-written-per-second for anything in v2.1.

@a-robinson
Copy link
Contributor

It's not one of the bigger offenders, but newReplicaStats could reasonably return an object on the stack rather than the heap.

For what it's worth, I just tried this out and performance got 1.3% worse on kv95 both with and without pre-splitting.

@nvanbenschoten
Copy link
Member Author

I spent some time with this cluster in the Go execution tracer and I've convinced myself that we are seeing a GC-related slowdown.

Here's a representative trace that includes a GC event:

screen shot 2018-09-20 at 11 29 50 pm

We can see a number of interesting things from this screenshot:

  1. the GC ran for about 220ms
  2. while running concurrently, the GC used 4 dedicated processors out of the 16 total
  3. there were dramatically fewer network events processed while the GC was running, even concurrently
  4. the number of runnable goroutines increased while the GC was running concurrently

Let's zoom in a little bit to get more info:

screen shot 2018-09-20 at 11 30 44 pm

More notes:

  1. the initial stop-the-world sweep termination phase of the GC lasted 209us
  2. the concluding stop-the-world mark termination phase of the GC lasted 334us
  3. even with 4 dedicated processors, the GC relied heavily on Mark Assistance by all other processors while running concurrently. In fact, during this "concurrent" GC period, goroutines are spending more time assisting the GC than doing actual work.
  4. a few of these MARK ASSIST periods were so long (> 10ms) that they stopped without finishing. I believe this means that the corresponding goroutines were heavily "in debt".
  5. (not pictured) even after the GC concluded, it still depended on other goroutines assistance in SWEEPing for 10-20us at a time.

I've included the trace here in case anyone else wants to take a look.

This all looks pretty bad, especially compared to clusters where I was seeing normal latencies. In those traces, concurrent GC runs usually lasted around 5-10ms, had somewhat less assistance (MARK ASSIST) from other gouroutines, and had stop-the-world phases that lasted about half as long.

I ran a few experiments where I played around with the GOGC environment variable (see here and here). I tripled its value to 300% on the same cluster and saw an interesting trend. I tripled it again to 900% and, sure enough, latencies began fluctuating. For about a few second, they would hold in the hundred-millisecond range, which is where we see them on a healthy cluster. Then a GC would kick in and they would jump up to the multi-second latencies I've been seeing here.

screen shot 2018-09-20 at 11 25 43 pm

GOGC=100, GOGC=300, GOGC=900

screen shot 2018-09-20 at 11 25 55 pm

GOGC=300, GOGC=900 zoomed in

screen shot 2018-09-20 at 11 26 18 pm

GOGC=100, GOGC=300, GOGC=900

From this experiment and from the profiles I posted above, I'm reasonably confident that the perf degredation is due to an increased garbage collection cost as ranges split and create more pointer-happy objects which stick around in memory. I think it's time to begin a allocation and pointer hunt.

The one thing that's doesn't add up from the trace is that there seem to be a few processors (14 & 15) who remain idle for a few ms during the concurrent mark phase. That doesn't make a lot of sense to me, as I'd expect processors to remain busy even if they are spending most of their time assisting the GC. Perhaps this is a side-effect of the drop in network events during the GC period. I'd like to find out more about this.

@spencerkimball
Copy link
Member

spencerkimball commented Sep 21, 2018 via email

@tbg
Copy link
Member

tbg commented Sep 21, 2018

but I’m going to need fast review support because when it sits for a couple of weeks, it rots very quickly. There’s a lot going on in core.

I'd hold off on that kind of change until we're reasonably sure we're not cherry-picking anything urgent onto release-2.1, i.e. at least until Oct 10th.

@nvanbenschoten
Copy link
Member Author

What’s the workload here though? Is this a uniform distribution over all
replicas, so the dehydration change will not necessarily help much?

This is running TPC-C 1k on a long-running three-node cluster that can handle up to TPC-C 1500. TPC-C's load distribution isn't perfectly uniform. For instance, ~72% of replicas are quiesced at any given time. However, it's hard to tell how much churn there is in this quiesced replica set.

@ridwanmsharif
Copy link

I think some things I've run into have something to do with this. Here's what I see on a kv workload without splits:
image
image

and here's what I see with 30 pre splits on the same workload with the same configuration:
image
image

You'll see that the QPS is significantly lower with splits and the Disk Write Ops is much higher.
I don't know what to make of this yet but the problem can be seen much clearly with load based splitting as seen in #31819

@nvanbenschoten
Copy link
Member Author

I took another look at this to try to see if past explorations missed anything obvious. In doing so, I found some interesting behavior.

The effect can be seen by comparing the following two workloads:

workload run kv --init --read-percent=0 --splits=0

and

workload run kv --init --read-percent=0 --splits=3

The latter workload results in about 20% lower throughput. This lines up with what @ridwanmsharif observed above. Interestingly, this perf difference disappears when I set kv.raft_log.synchronize = false. In fact, it actually inverts, which is what we would expect with a write-heavy workload as the number of Ranges increases. This indicates that the issue has to do with synchronizing the RocksDB WAL when concurrently appending to mutliple Raft logs.

We can see from a custom graph of raft.process.logcommit.latency-p50 that the splits have a direct impact on the median latency of these WAL syncs:

screen shot 2018-12-16 at 2 22 51 pm

Before the splits, the median log sync latency is around 230us. After 3 splits, it jumps to around 430us. This is very reproducible. My theory for what's happening is similar to what is discussed above. Log syncs are implemented using a single goroutine that loops around and continuously calls FlushWAL. Before the splits, the single range never needs to wait on an ongoing sync, so its sync is always kicked off immediately. After the splits, each range has to wait for previous syncs to complete before their sync can start. We would expect the average wait time to be 230/2 = 115us. This alone comes close to explaining the increase in sync latency.

To gain more confidence in this theory, I added a 115us sleep into the sync loop. This resulted in almost exactly a 20% throughput reduction in the no-split scenario, which lines up with the range-split scenario.

I'm thinking through how to address this. One idea I had was to run multiple syncLoops, but that didn't appear to have any effect. We're going to need to gain a better understanding of the parallelism permitted by FlushWAL to determine how we can improve this situation. It's also possible that this isn't something that we can easily fix and that the slowdown from a single Range to a few Ranges is simply something we need to accept.

However, this effect doesn't explain the continued perf degredation we saw on long running clusters. Even if each log sync requires waiting for a whole FlushWAL iteration before issuing one, this latency shouldn't continue to scale indefinitely. That issue is more worrying to me and requires further exploration.

@petermattis
Copy link
Collaborator

Interesting. There is no concurrency in FlushWAL as the WAL is a single file, rotated every time the memtable is flushed. I'm failing to imagine what concurrency would look like there. What we do achieve is that we don't call FlushWAL for every operation that needs to be synced, but sync operations in groups.

FYI, I'm seeing throughput increase for --splits=3 vs --splits=0 on a local 1 node cluster:

~ workload run kv --init --read-percent=0 --splits=0 --duration=10s
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0         116940        11693.7      1.4      1.4      1.8      2.2     14.2
~ workload run kv --init --read-percent=0 --splits=3 --duration=10s
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0         135885        13588.2      1.2      1.2      1.6      2.2     25.2

Are you running on GCE machines? Are you running a single-node cluster or a multi-node cluster?

@petermattis
Copy link
Collaborator

Interestingly, the results are different on a single-node GCE cluster:

roachprod ssh peter-test:2 -- ./workload run kv --init --read-percent=0 --splits=0 --duration=10s {pgurl:1}
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0          68686         6868.2      2.3      2.2      3.7      5.0     17.8
roachprod ssh peter-test:2 -- ./workload run kv --init --read-percent=0 --splits=3 --duration=10s {pgurl:1}
...
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0          55802         5580.0      2.9      2.5      6.3      7.9     13.1

This is probably due to the slower syncs on GCE machines.

@nvanbenschoten
Copy link
Member Author

I was running this on my Macbook. It's interesting that you didn't see a degradation on yours but did on a GCE VM. You're probably right that its due to slower syncs.

There is no concurrency in FlushWAL as the WAL is a single file, rotated every time the memtable is flushed.

The only degree of concurrency I see is that we could pipeline FlushWAL(false /* sync */) and SyncWAL() operations. I'm not sure that would get us very much though. I don't have a great feel for the relative costs of flushing the memtable vs. syncing the WAL file, but I would guess that the latter dominates the cost of syncLoop iterations.

@petermattis
Copy link
Collaborator

FlushWAL does not touch the memtable. It takes data that is buffered in user space and writes it to the WAL file. If sync is true, it then calls fsync (or fdatasync) on the file.

Nothing obvious is coming to mind about what can be done about this. Note that cockroach debug syncbench also shows the problem:

pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 1 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     1575.1      0.6      0.5      1.6      2.6      7.1
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 2 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     1764.0      0.7      0.9      3.7      5.2      5.8
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 4 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     3498.1      1.3      0.9      3.3      5.2      5.8
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 8 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s     6853.8      2.6      0.9      3.1      4.5      6.6
pmattis@peter-test-0001:~$ ./cockroach debug syncbench /mnt/data1/testdb -c 16 -d 10s --log-only=false
writing to /mnt/data1/testdb
_elapsed____ops/sec___mb/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
...
     10s    11214.2      4.3      1.0      3.9      6.3     11.0

Notice how performance doesn't increase much when going from 1 to 2 concurrent writers, but performance almost doubles when going from 2 to 4 and from 4 to 8.

@petermattis
Copy link
Collaborator

Thinking about this more, it is curious that performance decreases when moving from 1 to 4 ranges. Yes, each sync commit will have a greater likelihood of having to wait for a previous sync to finish, but we should be seeing as much or greater parallelism in the 4 range case. One difference is there is only a single Raft processing goroutine in the 1 range case. It is surprising that makes a difference, but perhaps something can be learned there.

@nvanbenschoten
Copy link
Member Author

FlushWAL does not touch the memtable. It takes data that is buffered in user space and writes it to the WAL file. If sync is true, it then calls fsync (or fdatasync) on the file.

I was misreading that. The idea still holds though - we could pipeline the buffer flush and the file sync.

it is curious that performance decreases when moving from 1 to 4 ranges.

Very curious. We do eventually achieve linear scaling though, which indicates that the degradation as ranges continue to split is due to some other effect.

@petermattis
Copy link
Collaborator

As mentioned in person, I'll pull on this thread some more to see if anything unravels.

@petermattis
Copy link
Collaborator

It is unsurprising that the number of WAL syncs increases as we move from 1 range to 2 ranges. We now have 2 Raft groups and we need to sync the log commits. But look how the number of syncs increases as we further increase the number of ranges:

ranges    syncs
     1    26970
     2    38950
     4    43330
     8    41820
    16    39290
    32    40210

This is from a 10s kv0 run (workload run kv --init --read-percent=0 --duration=10s) using some simple instrumentation.

@nvanbenschoten
Copy link
Member Author

So that lines up with perf dropoff we see, right? The first few ranges don't result in any WAL syncs being batched together, so they hurt performance. Above 4 ranges, we start seeing additional syncs coalesce with existing ones, so they no longer hurt performance and the additional concurrency they provide begins to dominate and improve overall performance.

@petermattis
Copy link
Collaborator

So that lines up with perf dropoff we see, right?

Yep. I've been doing a bit of experimentation here, but I'm not seeing anything that can help. If we delay the WAL syncs slightly to allow them to batch, I can decrease the number of syncs for the 2 and 4 range scenarios, but doing so only hurts performance.

@awoods187
Copy link
Contributor

I don't know if this is just noise but check this out on EBS vs. SSD
EBS
Warehouses | tpmC | Efficiency

warehouses tpmC efficency avg latency p50 latency p90 latency p95 latency p99 latency pMax latency
2150 21171 77% 4806.2 3758.1 7784.6 12884.9 30064.8 103079.2
2150 21250.3 77% 4908.6 3623.9 8321.5 12348 30064.8 103079.2
2150 19611.8 71% 6650.2 5637.1 10737.4 16106.1 36507.2 103079.2
2150 18841.5 68% 7760.5 7247.8 11811.2 16643 40802.2 103079.2
2150 18571.2 67% 8035.4 7516.2 12348 18253.6 40802.2 103079.2

vs. SSD

warehouses tpmC efficency avg latency p50 latency p90 latency p95 latency p99 latency pMax latency
24081.2 87% 3262 2684.4 6442.5 7516.2 10200.5 49392.1
24900.3 90% 2528.1 2013.3 5368.7 6174 8053.1 40802.2
22502.8 81% 4837.7 4026.5 8321.5 10200.5 19327.4 103079.2
22658.3 82% 4745.9 3892.3 8321.5 10200.5 20401.1 103079.2
22602.3 82% 4621.2 3758.1 8053.1 10200.5 20401.1 103079.2

Both have the drop off but it seems to be worse on EBS (and continually growing worse)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale
Projects
None yet
Development

No branches or pull requests

8 participants