-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: import/tpch/nodes=8 failed [release-2.1] #32898
Comments
release-2.1. That's still expected to fail there, though I'm going to have to backport some subset of changes that fixes this. |
…snapshots Backports cockroachdb#32594. This didn't apply cleanly, but only because I never cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within. Fixes cockroachdb#32898. Fixes cockroachdb#32900. Fixes cockroachdb#32895. /cc @cockroachdb/release Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1050805&tab=buildLog
|
Well, this sucks. On the plus side, there are no warnings about lots of snapshots, but there are a few stuck ranges that are usually caused by pending snapshots. Going to have to dig into the logs. |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052892&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052961&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1054703&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1055192&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1056651&tab=buildLog
|
This one is a real failure with stuck Raft commands. I'm not seeing the usual snapshot problems (which is good because I backported fixes for them). In fact, it doesn't seem that any Raft snapshots were (attempted to be) sent in the last two hours of the test. And yet, there are four Raft groups with stuck commands:
|
r3152 logs this a few times:
This suggests that something's supposed to be done with this range. There is no other chatter from the replicate queue, which suggests that it gets stuck really early. r3053 has no activity after the "have been waiting.." msg. r3167 has a stuck request and then does a replication change (but the request seems to remain stuck?)
The test timed out at 3:14, so a lot later than many of these events. |
Looking at a repro. This is one of the five ranges that has stuck proposals: What's weird is that many ranges seem to have four replicas (the alloc simulation wants to remove one, as you'd expect). But other than that this status seems "healthy" except for the pending command. I'm going to logspy in. |
Ah, that was easy:
I thought I had backported all of these fixes. Ah, vendor's most recent Raft vendor bump picks up only etcd-io/etcd#10167. I bet it doesn't pick up the fix for that PR, etcd-io/etcd#10199. Yeah, it. doesn't. |
Fixed by #33228. |
SHA: https://github.com/cockroachdb/cockroach/commits/1146a03cc217cb57bdddd795e2d2fe2806c64985
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1042719&tab=buildLog
The text was updated successfully, but these errors were encountered: