-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long restore of geo-distributed tpc-c 5k with storage queue errors #31172
Comments
This is reminiscent of a stress failure in TestStoreRangeMergeWatcher (that I don’t actually suspect is related to merges). I can’t pull the issue up right now; on mobile. |
@petermattis or @nvanbenschoten do you have bandwidth to take a look today? I suspect this repros quickly, but we might as well look at it while it's there. |
I don't have time to look at it today. I should have time to look at it on Friday. |
How long do geo-distributed restores usually take? Is this a regression, or expected? |
A more complete screenshot: So we have two healthy replicas and one that needs a raft snapshot (the data it needs includes a split, which is causing other ranges to get their snapshots blocked, but doesn't appear to be the root cause). The raft log is large enough to prevent sending the snapshot, but not ridiculous (18MB). The two healthy replicas should be able to commit a raft log truncation to resolve this problem, but they're not. The leader's proposal quota is full. It looks like there's a kind of deadlock in |
Do you want me to keep this cluster alive? its set to expire this evening |
I'm OK with letting this cluster go. But try again to see if it's repeatable. |
Alright I'm killing it now. I'll run it tomorrow morning |
I'm re-running this same setup now and hit another bug: #31260 But more importantly, I also ran into a bunch of errors:
|
I've also been running into the same issue with a similar setup (multi-region 12 nodes). |
@asubiotto which of the issues? The stuck restore? |
Probably not worth trying to repro this until #31330 (comment) is fixed. Also when reproing should turn merges off because of known bug at the time of writing (hopefully fixed in a few days tops): #31409 (comment) |
Yes, stuck (or very slow) restore with a lot of these |
@asubiotto do you still see this? As far as I can tell, we've fixed everything mentioned in this thread and I haven't been seeing this particular problem in my testing of #31409 (where before the fixes I would). |
Haven't tried again but planning to soon. Will update this issue when I do. |
Just restored tpcc 5k on a multi-region cluster with no issues so closing this one as my understanding is that all that has been mentioned has been fixed. Feel free to reopen if this is not the case. |
Describe the problem
I ran a 9 node tpc-c multi-region deployment using tpc-c bench and roachprod. The restore of tpc-c 5k took 12+ hrs (still in progress).
To Reproduce
bin/roachtest bench '^tpccbench/nodes=9/cpu=16/multi-region$$' --wipe=false --user=andy
Expected behavior
Faster restore of tpc-c
Additional data / screenshots
Environment:
The text was updated successfully, but these errors were encountered: