Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error checking in ERS #9330

Merged
merged 14 commits into from
Dec 16, 2021
Merged

Fix error checking in ERS #9330

merged 14 commits into from
Dec 16, 2021

Conversation

GuptaManan100
Copy link
Member

Description

The StopReplicationAndBuildStatusMaps function calls the StopReplicationAndGetStatus rpc and proceeds to check whether the received error is ErrNotReplica. We use this information to know that the tablet in question considered itself to be the primary or not.

However, we should not directly compare the error pointers since we receive an RPC error and the original error we sent from the vttablet. This PR changes one of the test to send the correct error message and fixes the underlying problem.

Also, as part of this PR, the two constant errors have been moved to the vterrors package from the mysql package.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 7, 2021

@harshit-gangal @deepthi To fix the issue currently we have changed the type of ErrNoReplica from a normal error to a vterrors error, we are sending it as a grpc error and then converting back to a vterror to check its type. But since type check is insufficient, we also need to do a substring match for the exact error message.

There is another approach that we can take. For now in the mysql package we have the constants for the mysql error numbers. We can add a new error number representing ErrNoReplica which does not collide with MySQL error numbers. Then we can send an error with that error message and use that to check the type. There are a couple of pros and cons associated with this approach

Pro - Approach is easier to understand by looking at the code and would eliminate the substring comparison that we have.
Con - Blurs the boundary between Vitess and MySQL errors. We would now have Vitess-specific MySQL errors apart from vterrors.

WDYT?

go/test/endtoend/reparent/emergencyreparent/ers_test.go Outdated Show resolved Hide resolved
go/vt/vterrors/constants.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/replication.go Outdated Show resolved Hide resolved
@vmg
Copy link
Collaborator

vmg commented Dec 9, 2021

@GuptaManan100 upon further thought, I think this is an important enough error case that we should add a new error code (while ensuring that it doesn't collide with MySQL's error codes). There's enough gaps in the MySQL error codes space to add a section of Vitess-specific errors -- keeping them close together will allow us to not blur the difference between MySQL errors and ours.

@GuptaManan100 GuptaManan100 marked this pull request as draft December 9, 2021 18:33
@GuptaManan100
Copy link
Member Author

Fixing the bug in ERS has led to a new issue. Consider the following scenario -

  1. There are 4 tablets, let's call them tab1, tab2, tab3 and tab4.
  2. Let tab1 be the primary originally.
  3. The cluster is working correctly.
  4. The user executes ERS requesting tab4 to be the new primary
  5. ERS calls StopReplicationAndBuildStatusMap. We run the fillStatus function for all the 4 tablets simultaneously.
  6. tab1, tab2 and tab3 return their results before tab4, maybe due to momentary congestion in the network.
  7. We cancel the context as soon as we receive 3 successful tablets. (n-1 check in the wait group).
  8. As a result, tab4 is not added to the list of candidates for promotion!

There doesn't seem to be a way to fix this problem without further deliberation, since we can't wait on all the n tablets too, if we did then if the primary was actually down, we would exhaust all our time waiting for it to return!

@GuptaManan100
Copy link
Member Author

Solution for the above problem is to always wait for the newPrimary that the user has asked for and any other n-2 tablets. Proceeding with this change

Signed-off-by: Manan Gupta <manan@planetscale.com>
…hich have no replication status

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…ewly introduced sql error number

Signed-off-by: Manan Gupta <manan@planetscale.com>
…loser to as though it failed

Signed-off-by: Manan Gupta <manan@planetscale.com>
…e status map

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 marked this pull request as ready for review December 14, 2021 16:00
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine overall, but maybe a few small changes needed.
Good work 💯

go/mysql/flavor.go Show resolved Hide resolved
go/vt/concurrency/error_group.go Show resolved Hide resolved
go/vt/vtctl/reparentutil/replication.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/replication.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/replication.go Show resolved Hide resolved
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@deepthi deepthi merged commit fa98041 into vitessio:main Dec 16, 2021
@deepthi deepthi deleted the ers-bug branch December 16, 2021 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in ERS checking the pointer of the error received
3 participants