-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: TestReplicateQueueDownReplicate failed under stress #28368
Comments
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=831873&tab=buildLog |
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=834674&tab=buildLog |
seeing "range not found on store {5 5}" repeatedly from this code, presumably while handling the server whose nodeID is 5. testutils.SucceedsSoon(t, func() error {
_, err := tc.AddReplicas(testKey, roachpb.ReplicationTarget{
NodeID: nodeID,
StoreID: server.GetFirstStoreID(),
})
if testutils.IsError(err, allowedErrs) {
return nil
}
return err
}) Looking at AddReplicas this makes sense: that code upreplicates and then waits for the replica to appear. But it might've been GC'ed in the meantime, for example here:
So this looks like a problem in AddReplicas. |
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=837548&tab=buildLog |
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=843778&tab=buildLog |
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=848830&tab=buildLog |
It is possible this was fixed by #28877. I'm going to try reproducing before and after that change. |
Nope, this still happens on current master with |
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=861628&tab=buildLog |
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=862681&tab=buildLog |
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=868416&tab=buildLog |
AddReplicas was verifying that a replica had indeed been added, but there's no guarantee that the replicate queue wouldn't have removed it in the meantime. Attempt to work around this somewhat. The real solution is not to provide that guarantee, but some tests likely rely on it (and the failure is extremely rare, i.e. the new for loop basically never runs). Observed in cockroachdb#28368. Release note: None
Hmm, reproing my earlier observation is difficult as this test likes to just completely slam the CPUs and times out under stressrace. I opened #30455, but will close this as it doesn't appear to repro in nightlies any more. |
30405: roachtest: mark acceptance as stable r=petermattis a=tschottdorf all of its subtests are already stable, but in running a test locally I noticed that the top-level test was marked as passing as unstable. I'm not sure, but this might mean that the top-level test would actually not fail? Either way, better to mark it as stable explicitly. We should also spend some thought on how diverging notions of Stable in sub vs top level test are treated, not sure that this is well-defined. Release note: None 30446: opt: fix panic when srf used with GROUP BY r=rytaft a=rytaft Instead of panicking, we now throw an appropriate error. Fixes #30412 Release note (bug fix): Fixed a panic that occurred when a generator function such as unnest was used in the SELECT list in the presence of GROUP BY. 30450: roachtest: remove now-unnecessary hack r=petermattis a=tschottdorf Closes #27717. Release note: None 30451: storage: give TestReplicateRemovedNodeDisruptiveElection more time r=petermattis a=tschottdorf Perhaps: Fixes #27253. Release note: None 30452: storage: de-flake TestReplicaIDChangePending r=petermattis a=tschottdorf setReplicaID refreshes the proposal and was thus synchronously writing to the commandProposed chan. This channel could have filled up due to an earlier reproposal already, deadlocking the test. Fixes #28132. Release note: None 30455: testcluster: improve AddReplicas check r=petermattis a=tschottdorf AddReplicas was verifying that a replica had indeed been added, but there's no guarantee that the replicate queue wouldn't have removed it in the meantime. Attempt to work around this somewhat. The real solution is not to provide that guarantee, but some tests likely rely on it (and the failure is extremely rare, i.e. the new for loop basically never runs). Observed in #28368. Release note: None 30456: storage: unskip TestClosedTimestampCanServe for non-race r=petermattis a=tschottdorf Fixes #28607. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
AddReplicas was verifying that a replica had indeed been added, but there's no guarantee that the replicate queue wouldn't have removed it in the meantime. Attempt to work around this somewhat. The real solution is not to provide that guarantee, but some tests likely rely on it (and the failure is extremely rare, i.e. the new for loop basically never runs). Observed in cockroachdb#28368. Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/bf76db84cb64dc90f65d8b2e129c75028127cda2
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=823111&tab=buildLog
The text was updated successfully, but these errors were encountered: