Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming ghostferry may fail due to inactive verifier #184

Closed
kolbitsch-lastline opened this issue May 12, 2020 · 2 comments · Fixed by Lastline-Inc/ghostferry#30 · May be fixed by #185
Closed

Resuming ghostferry may fail due to inactive verifier #184

kolbitsch-lastline opened this issue May 12, 2020 · 2 comments · Fixed by Lastline-Inc/ghostferry#30 · May be fixed by #185

Comments

@kolbitsch-lastline
Copy link

The inline verifier uses a "resume position" for keeping track of the last event that it has seen. Similar to the binlog writer's "resume position", ghostferry uses this verifier resume position to continue its job after a restart of the tool.

ghostferry uses the smaller value of inline verifier and binlog writer resume position to decide what's the position on the source DB from which to stream binlogs.

If the inline-verifier is not enabled enabled at all, its resume position can grow stale. In some cases it points to such an old position that resuming from it fails (if the source has already deleted such old replication logs).

While having a inline verifier that cannot keep up with the event frequency is a problem that can't really be solved. But we can tackle the trivial case where the verifier is not enabled at all (and we inevitably will get to a point where the initial position will become invalid)

kolbitsch-lastline pushed a commit to Lastline-Inc/ghostferry that referenced this issue May 12, 2020
If the inline-verifier is not enabled (as is the case for various
production uses of ghostferry), its binlog position can grow stale. In
some cases it points to such an old position that resuming from it
fails (if the source has already deleted such old replication logs).

This commit fixes this by relying solely on the binlog writer resume
position if the inline verifier is not enabled.

We still fail if the inline verifier *is* enabled and the position is
stale, but there is nothing one can do about that. If verification is
enabled, one must ensure that it's able to keep up with migration.

This fixes Shopify#184

Change-Id: I24cef00bb78c266705107b2b8b4008171186940d
kolbitsch-lastline pushed a commit to Lastline-Inc/ghostferry that referenced this issue May 13, 2020
If the inline-verifier is not enabled (as is the case for various
production uses of ghostferry), its binlog position can grow stale. In
some cases it points to such an old position that resuming from it
fails (if the source has already deleted such old replication logs).

This commit fixes this by relying solely on the binlog writer resume
position if the inline verifier is not enabled.

We still fail if the inline verifier *is* enabled and the position is
stale, but there is nothing one can do about that. If verification is
enabled, one must ensure that it's able to keep up with migration.

This fixes Shopify#184

Change-Id: Iec689d7651e533772642f366a0ff80c891d28157
kolbitsch-lastline added a commit to Lastline-Inc/ghostferry that referenced this issue May 13, 2020
* Fix resuming state without inline verifier

If the inline-verifier is not enabled (as is the case for various
production uses of ghostferry), its binlog position can grow stale. In
some cases it points to such an old position that resuming from it
fails (if the source has already deleted such old replication logs).

This commit fixes this by relying solely on the binlog writer resume
position if the inline verifier is not enabled.

We still fail if the inline verifier *is* enabled and the position is
stale, but there is nothing one can do about that. If verification is
enabled, one must ensure that it's able to keep up with migration.

This fixes Shopify#184
@shuhaowu
Copy link
Contributor

While having a inline verifier that cannot keep up with the event frequency is a problem that can't really be solved.

This would be a bug in the inline verifier then. The only way I can see this happening based on the code:

func (v *InlineVerifier) binlogEventListener(evs []DMLEvent) error {
if v.verifyDuringCutoverStarted.Get() {
return fmt.Errorf("cutover has started but received binlog event!")
}
for _, ev := range evs {
paginationKey, err := ev.PaginationKey()
if err != nil {
return err
}
v.reverifyStore.Add(ev.TableSchema(), paginationKey)
}
if v.StateTracker != nil {
v.StateTracker.UpdateLastResumableSourceBinlogPositionForInlineVerifier(evs[len(evs)-1].ResumableBinlogPosition())
}
return nil
}

is if the mutex that is protecting the reverifyStore.Add method causes a deadlock, as all other operations should be very fast. If that's the case, we should fix the InlineVerifier.

Have you observed something like this or were you just speculating on that point?

@kolbitsch-lastline
Copy link
Author

is if the mutex that is protecting the reverifyStore.Add method causes a deadlock, as all other operations should be very fast. If that's the case, we should fix the InlineVerifier.

maybe I did a bad job of explaining this. Sorry about that

It's not that the verifier is stuck, it's rather that it's not configured to run in the first place. That is, one could have it running for a first phase of a migration. Then, for some reason (for example after we feel confident that there is no data corruption and want to disable it for some reasons such as performance) we decide to disable the verifier.
At this point the position is "stuck" where the verifier left off, but it's now part of the state that is propagated across ghostferry invocations

Have you observed something like this or were you just speculating on that point?

I have observed this on 2 installations, but I'm running my fork'ed version. Checking the code, it seems that the upstream (your) version has the same code.

It would take me a bit to verify this on the upstream code, but I think conceptually the same problem exists there (it's not a "bug", it's really a conceptual issue). If your version somehow prevents the verifier state to be discarded if the verifier is disabled, you would work around the problem (which is essentially what my patch does), but I didn't see code for doing that

please also note the discussion on the original issue I opened:

#185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants