Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long restore of geo-distributed tpc-c 5k with storage queue errors #31172

Closed
awoods187 opened this issue Oct 10, 2018 · 18 comments
Closed

Long restore of geo-distributed tpc-c 5k with storage queue errors #31172

awoods187 opened this issue Oct 10, 2018 · 18 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@awoods187
Copy link
Contributor

awoods187 commented Oct 10, 2018

Describe the problem

I ran a 9 node tpc-c multi-region deployment using tpc-c bench and roachprod. The restore of tpc-c 5k took 12+ hrs (still in progress).

To Reproduce

  1. Ran bin/roachtest bench '^tpccbench/nodes=9/cpu=16/multi-region$$' --wipe=false --user=andy

Expected behavior
Faster restore of tpc-c

Additional data / screenshots

E181010 00:18:26.064789 1654046 storage/queue.go:788  [raftsnapshot,n1,s1,r17845/1:/Table/57/1/2125/37396{-/0}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 26a3d630 at applied index 18 with error: [n3,s3],r17845: cannot apply snapshot: snapshot intersects existing range [n3,s3,r3484/2:/Table/5{7/1/994/…-8}]
E181010 00:18:26.132189 1654073 storage/queue.go:788  [raftsnapshot,n1,s1,r17941/1:/Table/57/1/1829/67093{-/0}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 7e83cfbb at applied index 18 with error: [n3,s3],r17941: cannot apply snapshot: snapshot intersects existing range [n3,s3,r3484/2:/Table/5{7/1/994/…-8}]
E181010 00:18:26.199505 1653933 storage/queue.go:788  [raftsnapshot,n1,s1,r18000/1:/Table/57/1/2127/45317{-/0}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 042d4d5b at applied index 18 with error: [n3,s3],r18000: cannot apply snapshot: snapshot intersects existing range [n3,s3,r3484/2:/Table/5{7/1/994/…-8}]
E181010 00:18:26.266851 1654094 storage/queue.go:788  [raftsnapshot,n1,s1,r17773/1:/Table/57/1/187{7/248…-8/288…}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 27a33942 at applied index 42 with error: [n3,s3],r17773: cannot apply snapshot: snapshot intersects existing range [n3,s3,r3484/2:/Table/5{7/1/994/…-8}]

image

Environment:

  • CockroachDB version 2.1 beta on 9/17
@benesch
Copy link
Contributor

benesch commented Oct 10, 2018

This is reminiscent of a stress failure in TestStoreRangeMergeWatcher (that I don’t actually suspect is related to merges). I can’t pull the issue up right now; on mobile.

@tbg
Copy link
Member

tbg commented Oct 10, 2018

@petermattis or @nvanbenschoten do you have bandwidth to take a look today? I suspect this repros quickly, but we might as well look at it while it's there.

@petermattis
Copy link
Collaborator

I don't have time to look at it today. I should have time to look at it on Friday.

@awoods187
Copy link
Contributor Author

awoods187 commented Oct 10, 2018

I tried to re-run this on the 2.1 beta from 10/08 and it didn't finish restoring before it was killed by roachprod after 15 hours. I can't get the logs because it was already destroyed but i have this:

image

@petermattis
Copy link
Collaborator

How long do geo-distributed restores usually take? Is this a regression, or expected?

@bdarnell
Copy link
Contributor

A more complete screenshot:

image

So we have two healthy replicas and one that needs a raft snapshot (the data it needs includes a split, which is causing other ranges to get their snapshots blocked, but doesn't appear to be the root cause). The raft log is large enough to prevent sending the snapshot, but not ridiculous (18MB).

The two healthy replicas should be able to commit a raft log truncation to resolve this problem, but they're not. The leader's proposal quota is full. It looks like there's a kind of deadlock in updateProposalQuotaRaftMuLocked, in which this node is live enough to keep proposal quota tied up and allowing the log to be truncated.

@awoods187
Copy link
Contributor Author

Do you want me to keep this cluster alive? its set to expire this evening

@bdarnell
Copy link
Contributor

I'm OK with letting this cluster go. But try again to see if it's repeatable.

@awoods187
Copy link
Contributor Author

Alright I'm killing it now. I'll run it tomorrow morning

@awoods187
Copy link
Contributor Author

awoods187 commented Oct 11, 2018

I'm re-running this same setup now and hit another bug: #31260

But more importantly, I also ran into a bunch of errors:

TIME SEVERITY MESSAGE FILE:LINE
2018-10-11 12:41:12 ERROR [n1] failed attempt to acquire migration lease: lease /System/"system-version/lease" is not available until at least 1539261732.606390643,0
2018-10-11 12:41:24 ERROR [n1,raftsnapshot,s1,r436/1:/Table/55/1/9/8/2871{-/0}] snapshot failed: (n8,s8):2: remote couldn't accept Raft snapshot de5eca15 at applied index 16 with error: [n8,s8],r436: cannot apply snapshot: snapshot intersects existing range [n8,s8,r330/2:/Table/55/1/{7/9/11…-69/3/2…}]

@a-robinson
Copy link
Contributor

a-robinson commented Oct 11, 2018

The first error is fixed by #31270

The second is being worked on in #30064

@petermattis petermattis added the C-investigation Further steps needed to qualify. C-label will change. label Oct 14, 2018
@asubiotto
Copy link
Contributor

asubiotto commented Oct 17, 2018

I've also been running into the same issue with a similar setup (multi-region 12 nodes).

@tbg
Copy link
Member

tbg commented Oct 17, 2018

@asubiotto which of the issues? The stuck restore?

@tbg
Copy link
Member

tbg commented Oct 17, 2018

Probably not worth trying to repro this until #31330 (comment) is fixed. Also when reproing should turn merges off because of known bug at the time of writing (hopefully fixed in a few days tops): #31409 (comment)

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 17, 2018
@asubiotto
Copy link
Contributor

asubiotto commented Oct 17, 2018

Yes, stuck (or very slow) restore with a lot of these snapshot intersects range message. Is there some way I can work around it?

@tbg
Copy link
Member

tbg commented Oct 23, 2018

@asubiotto do you still see this? As far as I can tell, we've fixed everything mentioned in this thread and I haven't been seeing this particular problem in my testing of #31409 (where before the fixes I would).

@asubiotto
Copy link
Contributor

Haven't tried again but planning to soon. Will update this issue when I do.

@asubiotto
Copy link
Contributor

Just restored tpcc 5k on a multi-region cluster with no issues so closing this one as my understanding is that all that has been mentioned has been fixed. Feel free to reopen if this is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

7 participants