-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mysqld: check for correct return value from WAIT_FOR_EXECUTED_GTID_SET #14739
Conversation
Signed-off-by: deepthi <deepthi@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
For anyone who is interested, here's how I simulated this
|
@mattlord one thing to check during review is where we still use WaitForPosition etc. in vreplication imports and whether this breaks those for MariaDB / FilePos. |
For now I've marked this for backports, but we also have the option of reverting the change on release branches while we fix forward on main. |
Both the mariadb and file pos wait for position commands return -1 for timeout. We could enhance the flavor interface to get the value expected for timeout and use that rather than hardcoding it here. |
@@ -377,7 +377,7 @@ func (mysqld *Mysqld) WaitSourcePos(ctx context.Context, targetPos replication.P | |||
if result.IsNull() { | |||
return fmt.Errorf("%v(%v) failed: replication is probably stopped", waitCommandName, query) | |||
} | |||
if result.ToString() == "-1" { | |||
if result.ToString() == "1" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should invert the check here and validate success.
For any functions that return C style integer arguments where one value specifically indicates success, we should always check against that value. I’ve see trying to check against failure values to cause problems too many times.
So either here do a != “0” or switch the return and fallback and do == “0”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this way would work with 8.0 and 8.2 (and MariaDB). 0 is success, non-zero value is failure. And I think that we should add a const for it so that it's more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. I'll push a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See later comments from Rohit.
Unfortunately for We should delegate the return value interpretation to the flavor interface and let that interface return SUCCESS, TIMEOUT or ERROR?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Trying to understand the logic for backporting / not backporting the fix, wouldn't this be needed in earlier Vitess versions that might need to work with later MySQL versions?
Closing in favor of #14745 |
We have talked about this before. There is no guarantee that an earlier Vitess version will work with a MySQL version that comes along later. In fact, we only guarantee that it works with the versions that we are providing in the docker images. We probably need to write this down somewhere to avoid confusion / incorrect assumptions. |
Got it! I'll keep that in mind. |
Description
In #14612, we changed how we wait for a primary candidate to catch up to a specific replication position. Specifically for the mysql flavor, instead of using
WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS
we now useWAIT_FOR_EXECUTED_GTID_SET
.However, the two functions return different values on timeout. The old one returns -1, whereas the new one returns 1.
This was causing us to assume that
WaitForSourcePos
was successful when in fact it had timed out, leading to the situation described in #14738.I have ignored the MariaDB flavor since we dropped support in v16. In any case, that flavor is executing this command with no timeout specified, which means it will wait indefinitely and return 0.
Right now the only "affected" version is main. However, because #14612 was back ported to all release branches,
this fix also needs to be back ported and needs to go into the next v18 patch releasewe plan to revert it in all the release branches. Luckily, we haven't actually made any new patch releases after #14612 was merged.I spent some time trying to come up with a way to unit test this, but ran into the old problem of the complexity of mocking a real mysql. I'll probably spend more time on that later on, but right now we need to get this fixed ASAP.
Proof of the correctness of this change (MySQL only):
Related Issue(s)
WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS
#14611Checklist
Deployment Notes
EDIT: instead of back porting this PR, we are reverting the original PR on all release branches. That gives us time to debate the code changes and get them right on main for v19.