-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reparentutil / ERS] confirm at least one replica succeeded to SetMaster
, or fail
#7486
Conversation
Okay I seem to have made those tests flaky .... taking a look. Hilariously they passed the few times I ran them locally, but if I run them enough times in a row I can get them to fail in a few ways. |
93fb928
to
35d1db6
Compare
This introduces two background goroutines and contexts derived from `context.Background()` to poll and signal for (1) at least one replica finished successfully and (2) all replicas finished, regardless of status, respectively. The overall `promotePrimary` function returns error conditions based on which of those contexts gets signaled first. We still fail if `PromoteReplica` fails, which gets checked first, but this covers the case where we're not running with semi-sync, and none of the replicas are able to reparent even if the new primary is able to populate its reparent journal (in the semi-sync case, if 0 replicas succeed to `SetMaster`, than the primary will fail to `PromoteReplica`, so this was already covered there). Fixes vitessio#7480 Signed-off-by: Andrew Mason <amason@slack-corp.com>
…l case Signed-off-by: Andrew Mason <amason@slack-corp.com>
35d1db6
to
1ee9045
Compare
…r` and context cancelling Also add a duplicated check on whether at least one replica succeeded, to guard against racing goroutines between the `replSuccessCtx` and `allReplicasDoneCtx` cancellations. Signed-off-by: Andrew Mason <amason@slack-corp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. The comments + tests are great, as usual. ✨
I'm not yet confident enough in my Vitess or Go concurrency skillz to give you an official approval on this, though, since ERS is... rather important! 😎
@deepthi or @setassociative would you mind reviewing this one, too, when you get a chance? 🙇♀️
I'm a little concerned with encouraging use of EmergencyReparent in cases where a user isn't using semi-sync, because you would very likely experience data loss. Certainly I think it's questionable that we could select an appropriate replica on behalf of the user in those cases. We could have a I realize that it's actually not related to what this PR actually does, and is more replying to what you wrote in your description and related issue. We should think about restricting the mode where we find the best candidate to only semi-sync use cases. |
// At least one replica was able to SetMaster successfully | ||
return nil | ||
case <-allReplicasDoneCtx.Done(): | ||
if len(rec.Errors) >= numReplicas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would summed up errors exceed the number of replicas? I think if it should never be greater than this is an odd thing to write in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that being more permissive here means the code still works if handleReplica
is ever updated to record multiple errors per replica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if that's the case it should get updated when that happens. This code path implies this is a valid state of existence. (I'm required as a reader to read into the comment to understand that this is simply defensive programming, and for a reality we don't live in yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess
if len(rec.Errors) == numReplicas
feels weirdly specific, but I agree with @PrismaPhonic that if we ever record more than one error from each goroutine, this will need to change anyway. So it is probably best to check for the specific condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path implies this is a valid state of existence.
The code directly states that this is an error state and not a valid state.
If I change this to len(rec.Errors) == numReplicas
then we can return success in places where we definitely shouldn't. Open to suggestions on how to make the comments more clear, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
switch {
case len(rec.Errors) == numReplicas:
// original error case in question
case len(rec.Errors) > numReplicas:
return vterrors.Wrapf(rec.Error(), "received more errors (= %d) than replicas (= %d), which should be impossible", len(rec.Errors), numReplicas)
default:
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still am not seeing how we would return success in cases when we shouldn't if we truly think that errors will never exceeed count of num replicas. This does seem overly defensive to me, but I do think this newer version is better.
@@ -142,6 +142,10 @@ func FindValidEmergencyReparentCandidates( | |||
// ReplicaWasRunning returns true if a StopReplicationStatus indicates that the | |||
// replica had running replication threads before being stopped. | |||
func ReplicaWasRunning(stopStatus *replicationdatapb.StopReplicationStatus) bool { | |||
if stopStatus.Before == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure false is the right thing to return here. If before is nil we can't know if it was running or not. Not getting Before
at all, IMO, should be a panic case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't have code in the critical path of an ERS that can panic, because we'd either need to add recover
defers and figure out how to recover from ... any point where the process can fail, or risk leaving the cluster in a worse state than when we started.
How about:
func ReplicaWasRunning(stopStatus *replicationdatapb.StopReplicationStatus) (bool, error) {
if stopStatus.Before == nil {
return false, errors.New("...")
}
return stopStatus.Before.IoThreadRunning || stopStatus.Before.SqlThreadRunning, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better error than panic :) This lgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the comment on the comment.
// to signal when all replica goroutines have finished. In the case where at | ||
// least one replica succeeds, replSuccessCtx will be canceled first, while | ||
// allReplicasDoneCtx is guaranteed to be canceled within | ||
// opts.WaitReplicasTimeout plus some jitter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. IIUC, the calls to SetMaster
are bounded by WaitReplicasTimeout
which guarantees that allReplicasDoneCancel
is eventually called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely!
// finished. If one replica is slow, but another finishes quickly, the main | ||
// thread of execution in this function while this goroutine will run until | ||
// the parent context times out, without slowing down the flow of ERS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rephrase this sentence? I had difficulty parsing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me take a shot at editing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if the newest version makes more sense, or if you have other suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better.
I don't see that we are encouraging it. Can you elaborate? |
@deepthi Just saw your reply. I edited my comment to clarify that I was referring to the authors comments in their description and related issue. |
@PrismaPhonic Re semi-sync, I think we should also fix #7441 |
R.e. semi-sync, happy to talk more about that outside the PR, but I don't think it has any bearing here. This doesn't change the behavior in that case; it only fixes the case where if you were running semi-sync, and all your replicas failed to For the actual safety/integrity of your vttablet components (not running semi-sync), that is unchanged as a result of this PR. |
Signed-off-by: Andrew Mason <amason@slack-corp.com>
…and callers Signed-off-by: Andrew Mason <amason@slack-corp.com>
… cases Signed-off-by: Andrew Mason <amason@slack-corp.com>
eba839c
to
8f2e1fc
Compare
This takes the core of the change from vitessio#7486 and backports it into 8.0. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
Backport some panic protection during ERS This takes the core of the change from vitessio#7486 and backports it into 8.0. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
Backport some panic protection during ERS This takes the core of the change from vitessio#7486 and backports it into 8.0. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
Description
This introduces two background goroutines and contexts derived from
context.Background()
to poll and signal for (1) at least one replicafinished successfully and (2) all replicas finished, regardless of
status, respectively.
The overall
promotePrimary
function returns error conditions based onwhich of those contexts gets signaled first. We still fail if
PromoteReplica
fails, which gets checked first, but this covers thecase where we're not running with semi-sync, and none of the replicas
are able to reparent even if the new primary is able to populate its
reparent journal (in the semi-sync case, if 0 replicas succeed to
SetMaster
, than the primary will fail toPromoteReplica
, so this wasalready covered there).
Fixes #7480
Signed-off-by: Andrew Mason amason@slack-corp.com
Related Issue(s)
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: