-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: command commit latency is highly correlated with range count #30213
Comments
cc. @petermattis @a-robinson in case either has any immediate intuition about this. |
The increasing range count also correlates with increasing on disk data.
You could eliminate range count as a factor by running a short run, then
splitting the tables into twice as many ranges and running again.
…On Thu, Sep 13, 2018 at 12:41 PM Nathan VanBenschoten < ***@***.***> wrote:
cc. @petermattis <https://github.com/petermattis> @a-robinson
<https://github.com/a-robinson> in case either has any immediate
intuition about this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30213 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF6f93IrRgmhVtE61CVPvqqTamhgdHWSks5uarTsgaJpZM4WoFIu>
.
|
Does it? Sure we'll split ranges as the amount of on disk data grows, but ranges themselves don't create a non-negligible amount of on disk data and we see commit latency grow dramatically at the same time that the range count jumps. If this were just about on disk data size, I'd expect the command commit latency to increase more gradually, without any jumps. Your suggestion about extra splitting to test this out is a good idea. |
Ah, that is interesting. I didn't see that these correlated so precisely before (I was looking at the graphs on my phone). I have no explanation for that.
Btw, this batching really shouldn't be necessary as RocksDB performs similar batching internally. But as we saw recently (when it was accidentally broken) it provides a significant performance improvement. I mentioned that we're doing this batching to Andrew on the RocksDB team and he agreed it is curious that it provides a benefit. When I investigated this in the past, I could see that the RocksDB batching was very rarely grouping more than one batch together, while our batching was doing so very frequently. I never figured out what was going wrong on the RocksDB side. A bug in RocksDB? An interaction with cgo? There might be a performance win from figuring this out. |
The trend has continued:
Mind pointing me at the prior investigation and also at what I should be looking at in RocksDB? I'd like to dig in and make sure we're not leaving any perf on the table. |
See #14138. |
I added some instrumentation into our RocksDB commit pipeline (in Go), but nothing really stood out. I measured different phases on the pipeline to get a feel for where time was being spent.
Nothing really stood out. I then took all nodes offline and performed compactions on them. When I brought them back online, I saw that log commit latencies were a lot faster. However, this speedup went away over the course of a day. p99 wait latency, bundle size, bundle latency, commit latency, broadcast latency, sync latency Some data is missing from before I added in the metrics. p99 log commit latency and command commit latency over same period Current Values
Put together, this doesn't really tell us much about where the issue is, but it does disprove a number of theories I had. For instance, it's pretty clear that the cost of bundling batches isn't an issue. I think I need to go deeper and figure out why we're seeing such long tail latencies in the RocksDB sync loop. |
Divide everything in my previous comment by a factor of 3. I was using the custom chart page with a sum aggregator and without the graphs in "per node" mode. |
Here's an interesting chart: GC pause time vs. replica count. I saw a few gcAssist calls in a CPU profile and found that GC related activity is accounting for 9.65% of CPU utilization on these nodes. It may just be because #27815 is fresh on my mind, but I'm getting suspicious that this is GC related. Specifically, I think the replica growth is resulting in more objects sitting in memory and slowing down the GC, which in turn is slowing down everything else. This is corroborated by inuse object heap profiles: We can see that Replica-related memory is responsible for over a quarter of inuse objects. @spencerkimball has observed issues like this in his extreme replica count testing. It would be interesting to test out his "replica dehydration" wip branch on this cluster and see if it has any effect. To start with, I'm going to pick at the low hanging fruit here. For instance, it looks like all pointers in |
Yes, I think that's basically the reason. |
It turns out that most of these pointers were introduced in #18689, which made proto fields nullable so that they would be omitted from the encoded proto. It's really unfortunate that these two concepts are tied together so tightly. In I'm going to do some digging with |
Is the lifetime of |
Well, it's never deallocated, it just lives on the |
Oh, I thought the issue with |
I wonder if the pointers in |
It's not one of the bigger offenders, but Also, one of the two |
For what it's worth, I just tried this out and performance got 1.3% worse on kv95 both with and without pre-splitting. |
I spent some time with this cluster in the Go execution tracer and I've convinced myself that we are seeing a GC-related slowdown. Here's a representative trace that includes a GC event: We can see a number of interesting things from this screenshot:
Let's zoom in a little bit to get more info: More notes:
I've included the trace here in case anyone else wants to take a look. This all looks pretty bad, especially compared to clusters where I was seeing normal latencies. In those traces, concurrent GC runs usually lasted around 5-10ms, had somewhat less assistance (MARK ASSIST) from other gouroutines, and had stop-the-world phases that lasted about half as long. I ran a few experiments where I played around with the GOGC environment variable (see here and here). I tripled its value to 300% on the same cluster and saw an interesting trend. I tripled it again to 900% and, sure enough, latencies began fluctuating. For about a few second, they would hold in the hundred-millisecond range, which is where we see them on a healthy cluster. Then a GC would kick in and they would jump up to the multi-second latencies I've been seeing here. GOGC=100, GOGC=300, GOGC=900 GOGC=300, GOGC=900 zoomed in GOGC=100, GOGC=300, GOGC=900 From this experiment and from the profiles I posted above, I'm reasonably confident that the perf degredation is due to an increased garbage collection cost as ranges split and create more pointer-happy objects which stick around in memory. I think it's time to begin a allocation and pointer hunt. The one thing that's doesn't add up from the trace is that there seem to be a few processors (14 & 15) who remain idle for a few ms during the concurrent mark phase. That doesn't make a lot of sense to me, as I'd expect processors to remain busy even if they are spending most of their time assisting the GC. Perhaps this is a side-effect of the drop in network events during the GC period. I'd like to find out more about this. |
Some recent changes to replica metrics computation cause multiple arrays to
be allocated for every range. That can’t be helping either. I’m sending out
a PR today to remove those. Also, I’m rehabilitating the replica
dehydration change but I’m going to need fast review support because when
it sits for a couple of weeks, it rots very quickly. There’s a lot going on
in core.
What’s the workload here though? Is this a uniform distribution over all
replicas, so the dehydration change will not necessarily help much?
…On Fri, Sep 21, 2018 at 12:17 AM Nathan VanBenschoten < ***@***.***> wrote:
I spent some time with this cluster in the Go execution tracer and I've
convinced myself that we are seeing a GC-related slowdown.
Here's a representative trace that includes a GC event:
[image: screen shot 2018-09-20 at 11 29 50 pm]
<https://user-images.githubusercontent.com/5438456/45859232-2ebc3900-bd2f-11e8-8ba4-22ea68a3c5ed.png>
We can see a number of interesting things from this screenshot:
1. the GC ran for about 220ms
2. while running concurrently, the GC used 4 dedicated processors out
of the 16 total
3. there were dramatically fewer network events processed while the GC
was running, even concurrently
4. the number of runnable goroutines increased while the GC was
running concurrently
Let's zoom in a little bit to get more info:
[image: screen shot 2018-09-20 at 11 30 44 pm]
<https://user-images.githubusercontent.com/5438456/45859353-d20d4e00-bd2f-11e8-84c2-b0a1bc78e26e.png>
More notes:
1. the initial stop-the-world sweep termination phase of the GC lasted
209us
2. the concluding stop-the-world mark termination phase of the GC
lasted 334us
3. even with 4 dedicated processors, the GC relied heavily on Mark
Assistance by all other processors while running concurrently. In fact,
during this "concurrent" GC period, goroutines are spending more time
assisting the GC than doing actual work.
4. a few of these MARK ASSIST periods were so long (> 10ms) that they
stopped without finishing. I believe this means that the corresponding
goroutines were heavily "in debt".
5. (not pictured) even after the GC concluded, it still depended on
other goroutines assistance in SWEEPing for 10-20us at a time.
I've included the trace here
<https://github.com/cockroachdb/cockroach/files/2403826/trace.out.zip> in
case anyone else wants to take a look.
This all looks pretty bad, especially compared to clusters where I was
seeing normal latencies. In those traces, concurrent GC runs usually lasted
around 5-10ms, had somewhat less assistance (MARK ASSIST) from other
gouroutines, and had stop-the-world phases that lasted about half as long.
I ran a few experiments where I played around with the GOGC environment
variable (see here
<https://golang.org/pkg/runtime/#hdr-Environment_Variables> and here
<https://golang.org/pkg/runtime/debug/#SetGCPercent>). I tripled its
value to 300% on the same cluster and saw an interesting trend. I tripled
it again to 900% and, sure enough, latencies began fluctuating. For about a
few second, they would hold in the hundred-millisecond range, which is
where we see them on a healthy cluster. Then a GC would kick in and they
would jump up to the multi-second latencies I've been seeing here.
[image: screen shot 2018-09-20 at 11 25 43 pm]
<https://user-images.githubusercontent.com/5438456/45859783-e9e5d180-bd31-11e8-9757-6dfafc5aad92.png>
*GOGC=100, GOGC=300, GOGC=900*
[image: screen shot 2018-09-20 at 11 25 55 pm]
<https://user-images.githubusercontent.com/5438456/45859802-084bcd00-bd32-11e8-9417-a69322cd7ef4.png>
*GOGC=300, GOGC=900 zoomed in*
[image: screen shot 2018-09-20 at 11 26 18 pm]
<https://user-images.githubusercontent.com/5438456/45859806-0c77ea80-bd32-11e8-9d8c-4aacc8dd017e.png>
*GOGC=100, GOGC=300, GOGC=900*
From this experiment and from the profiles I posted above, I'm reasonably
confident that the perf degredation is due to an increased garbage
collection cost as ranges split and create more pointer-happy objects which
stick around in memory. I think it's time to begin a allocation and pointer
hunt.
The one thing that's doesn't add up from the trace is that there seem to
be a few processors (14 & 15) who remain idle for a few ms during the
concurrent mark phase. That doesn't make a lot of sense to me, as I'd
expect processors to remain busy even if they are spending most of their
time assisting the GC. Perhaps this is a side-effect of the drop in network
events during the GC period. I'd like to find out more about this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30213 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTVY3bPBk5sMSEBrg6fuNBT026Jodks5udGhGgaJpZM4WoFIu>
.
|
I'd hold off on that kind of change until we're reasonably sure we're not cherry-picking anything urgent onto release-2.1, i.e. at least until Oct 10th. |
This is running TPC-C 1k on a long-running three-node cluster that can handle up to TPC-C 1500. TPC-C's load distribution isn't perfectly uniform. For instance, ~72% of replicas are quiesced at any given time. However, it's hard to tell how much churn there is in this quiesced replica set. |
I think some things I've run into have something to do with this. Here's what I see on a kv workload without splits: and here's what I see with 30 pre splits on the same workload with the same configuration: You'll see that the QPS is significantly lower with splits and the Disk Write Ops is much higher. |
I took another look at this to try to see if past explorations missed anything obvious. In doing so, I found some interesting behavior. The effect can be seen by comparing the following two workloads:
and
The latter workload results in about 20% lower throughput. This lines up with what @ridwanmsharif observed above. Interestingly, this perf difference disappears when I set We can see from a custom graph of Before the splits, the median log sync latency is around To gain more confidence in this theory, I added a I'm thinking through how to address this. One idea I had was to run multiple However, this effect doesn't explain the continued perf degredation we saw on long running clusters. Even if each log sync requires waiting for a whole |
Interesting. There is no concurrency in FYI, I'm seeing throughput increase for
Are you running on GCE machines? Are you running a single-node cluster or a multi-node cluster? |
Interestingly, the results are different on a single-node GCE cluster:
This is probably due to the slower syncs on GCE machines. |
I was running this on my Macbook. It's interesting that you didn't see a degradation on yours but did on a GCE VM. You're probably right that its due to slower syncs.
The only degree of concurrency I see is that we could pipeline |
Nothing obvious is coming to mind about what can be done about this. Note that
Notice how performance doesn't increase much when going from 1 to 2 concurrent writers, but performance almost doubles when going from 2 to 4 and from 4 to 8. |
Thinking about this more, it is curious that performance decreases when moving from 1 to 4 ranges. Yes, each sync commit will have a greater likelihood of having to wait for a previous sync to finish, but we should be seeing as much or greater parallelism in the 4 range case. One difference is there is only a single Raft processing goroutine in the 1 range case. It is surprising that makes a difference, but perhaps something can be learned there. |
I was misreading that. The idea still holds though - we could pipeline the buffer flush and the file sync.
Very curious. We do eventually achieve linear scaling though, which indicates that the degradation as ranges continue to split is due to some other effect. |
As mentioned in person, I'll pull on this thread some more to see if anything unravels. |
It is unsurprising that the number of WAL syncs increases as we move from 1 range to 2 ranges. We now have 2 Raft groups and we need to sync the log commits. But look how the number of syncs increases as we further increase the number of ranges:
This is from a 10s kv0 run ( |
So that lines up with perf dropoff we see, right? The first few ranges don't result in any WAL syncs being batched together, so they hurt performance. Above 4 ranges, we start seeing additional syncs coalesce with existing ones, so they no longer hurt performance and the additional concurrency they provide begins to dominate and improve overall performance. |
Yep. I've been doing a bit of experimentation here, but I'm not seeing anything that can help. If we delay the WAL syncs slightly to allow them to batch, I can decrease the number of syncs for the 2 and 4 range scenarios, but doing so only hurts performance. |
I don't know if this is just noise but check this out on EBS vs. SSD
vs. SSD
Both have the drop off but it seems to be worse on EBS (and continually growing worse) |
On a cluster running TPC-C for a few days, I've noticed that the p99 command commit latency and the p99 log commit latency are both slowly growing. This growth seems to be highly correlated with the range count in the cluster.
Interestingly, TPC-C has a fixed amount of load, so it would appear that the range count itself is the only moving variable here. More ranges but a fixed amount of load would result in less batching of RocksDB writes because fewer writes would take place in the same Raft groups. However, our RocksDB commit pipeline attempts to transparently batch independent writes together, so this should help avoid this kind of issue:
cockroach/pkg/storage/engine/rocksdb.go
Lines 1752 to 1753 in 33c7d27
I'd like to instrument this pipeline and see if there are any inefficiencies in it. Specifically, I'd like to check whether the pipeline remains full as the number of batches that it attempts to batch together grows. For instance, it may be that case that the write batch merging begins to take longer than the RocksDB writes themselves. This would allow for gaps in the pipeline where the RocksDB
syncLoop
remains idle.The text was updated successfully, but these errors were encountered: