-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming ghostferry may fail due to inactive verifier #184
Resuming ghostferry may fail due to inactive verifier #184
Comments
If the inline-verifier is not enabled (as is the case for various production uses of ghostferry), its binlog position can grow stale. In some cases it points to such an old position that resuming from it fails (if the source has already deleted such old replication logs). This commit fixes this by relying solely on the binlog writer resume position if the inline verifier is not enabled. We still fail if the inline verifier *is* enabled and the position is stale, but there is nothing one can do about that. If verification is enabled, one must ensure that it's able to keep up with migration. This fixes Shopify#184 Change-Id: I24cef00bb78c266705107b2b8b4008171186940d
If the inline-verifier is not enabled (as is the case for various production uses of ghostferry), its binlog position can grow stale. In some cases it points to such an old position that resuming from it fails (if the source has already deleted such old replication logs). This commit fixes this by relying solely on the binlog writer resume position if the inline verifier is not enabled. We still fail if the inline verifier *is* enabled and the position is stale, but there is nothing one can do about that. If verification is enabled, one must ensure that it's able to keep up with migration. This fixes Shopify#184 Change-Id: Iec689d7651e533772642f366a0ff80c891d28157
* Fix resuming state without inline verifier If the inline-verifier is not enabled (as is the case for various production uses of ghostferry), its binlog position can grow stale. In some cases it points to such an old position that resuming from it fails (if the source has already deleted such old replication logs). This commit fixes this by relying solely on the binlog writer resume position if the inline verifier is not enabled. We still fail if the inline verifier *is* enabled and the position is stale, but there is nothing one can do about that. If verification is enabled, one must ensure that it's able to keep up with migration. This fixes Shopify#184
This would be a bug in the inline verifier then. The only way I can see this happening based on the code: Lines 558 to 577 in 04dda85
is if the mutex that is protecting the Have you observed something like this or were you just speculating on that point? |
maybe I did a bad job of explaining this. Sorry about that It's not that the verifier is stuck, it's rather that it's not configured to run in the first place. That is, one could have it running for a first phase of a migration. Then, for some reason (for example after we feel confident that there is no data corruption and want to disable it for some reasons such as performance) we decide to disable the verifier.
I have observed this on 2 installations, but I'm running my fork'ed version. Checking the code, it seems that the upstream (your) version has the same code. It would take me a bit to verify this on the upstream code, but I think conceptually the same problem exists there (it's not a "bug", it's really a conceptual issue). If your version somehow prevents the verifier state to be discarded if the verifier is disabled, you would work around the problem (which is essentially what my patch does), but I didn't see code for doing that please also note the discussion on the original issue I opened: |
The inline verifier uses a "resume position" for keeping track of the last event that it has seen. Similar to the binlog writer's "resume position",
ghostferry
uses this verifier resume position to continue its job after a restart of the tool.ghostferry
uses the smaller value of inline verifier and binlog writer resume position to decide what's the position on the source DB from which to stream binlogs.If the inline-verifier is not enabled enabled at all, its resume position can grow stale. In some cases it points to such an old position that resuming from it fails (if the source has already deleted such old replication logs).
While having a inline verifier that cannot keep up with the event frequency is a problem that can't really be solved. But we can tackle the trivial case where the verifier is not enabled at all (and we inevitably will get to a point where the initial position will become invalid)
The text was updated successfully, but these errors were encountered: