-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: RESTORE slowing at end #23509
Comments
cockroach/pkg/ccl/storageccl/import.go Line 40 in fe81631
|
I think the problem here is that scattering a small number of ranges (hundreds) in a cluster containing tens of thousands of ranges will often be a no-op. See #23358 (comment) for a suggestion about plumbing a new flag down to |
This should speed up `./workload fixture load`, especially when we're seeing the long-tail behavior described in cockroachdb#23509. Release note: None
I observe this on all of my TPC-C restores of 10k. The last 0.2tb take ~40m when the first 2tb take 120m. |
That looks like a worse problem. Is that cluster still up?
…On Wed, Nov 14, 2018, 03:34 Andy Woods ***@***.*** wrote:
I can see similar things on master. For example:
[image: image]
<https://user-images.githubusercontent.com/22278911/48456107-cd8f6f80-e78b-11e8-8160-6994c56534f9.png>
The job has been at 99% for at least the last 2 hours:
[image: image]
<https://user-images.githubusercontent.com/22278911/48456121-dbdd8b80-e78b-11e8-80eb-82337b9688f4.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23509 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135O4AFibgh5MdXOn6iFnNBb14VSdHks5uu4EygaJpZM4SffCa>
.
|
@awoods187 just sent me a cluster of his that had the same symptom. Not sure it's the root cause of the behavior he's seeing but the cluster had all the signs of #31875 (comment) which is hopefully fixed once that PR lands. |
The time-bound iterator workaround (#32909) eliminates at least one class of these tail issues for incremental backup. We're also seeing this behavior in restore, import, and non-incremental backup, so that was clearly not the only cause. Just FYI that it was a partial fix. |
cc @pbardea thoughts on how relevant this still is with the work you did in 20.2? |
There are a few things going on here:
A lot of this issue is also covered in #63925. One thing that hasn't been benchmarked is the performance of restoring a table into a cluster that already holds a significant amount of data. My hypothesis is that the RandomizeLeases flag on the AdminScatter request should help ensure that the entries are probably distributed, but I can run a benchmark to confirm. If we see pretty even AddSSTable traffic across the nodes in that benchmark I think we can close out this issue. |
We think we’ve made substantial improvements here, will close for now. There is probably more to be done, but not urgently, and can re-open if need be. |
During a TPCC restore, we noticed that the speed of the restore of a table slows way down near the end. Looking at a goroutine dump, a single node (out of a 24 node cluster) had hundreds of goroutines in beginLimitedRequest, which throttles import requests to one at a time.
Next question was: why did that node (node 9) have so many requests after a scatter had run. Doing a range dump for the new table, we made a histogram with the number of leader leases per node of the new table:
This indeed does show node 9 having a much higher number of leader leases compared to other nodes.
Our guess is that scatter is indeed scattering, but since various restores have already happened, it is possible that scatter has no choice but to put the new ranges onto some underfull nodes, which makes the restore eventually take a long time on that node. We don't think there's any specific thing to do regarding this scatter problem at the current time.
However, we do think it's worth removing or upping the limit in beginLimitedRequest since we have some other limits during addsstable that may make the beginLimitedRequest throttling unnecessary.
Epic CRDB-6406
The text was updated successfully, but these errors were encountered: