-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dm validation binlog position is not accurate and not updated frequently enough #8463
Comments
binlog position printed in validator status is the position where validator has processed, so if it >= validator checkpoint not only include the position, also include pending data which may need retry, since the pending data might quite large depends on how data is changed on upstream, flush checkpoint too frequent may slow validator down and add more pressure to downstream, so the interval is set to a quite large value(5 minutes) |
thanks for the reply! could you help me to understand more on this? my understanding is the i agree the flush checkpoint should not be frequently, however, the frequency to update the in memory binlog position doesn't have to be the same of flushing checkpoints? all other fields in the status are updated in realtime, like pending count, but not the binlog position |
tiflow/dm/syncer/data_validator.go Line 203 in 5d9f185
before flush we dispatch a tiflow/dm/syncer/data_validator.go Line 972 in 5d9f185
this position is stored in validator loop. tiflow/dm/syncer/data_validator.go Lines 671 to 672 in 5d9f185
|
oh, i c, thanks, it makes sense! then i think the issue is if it is possible to update the binlog position more realtime? this is critical for us because during aurora -> tidb migration, we will cut all aurora connection, wait for tidb catching up (binlog position + pending event check in validation), then route traffic to tidb, thus if update binlog position relies on flush checkpoint, then it means very long downtime in the failover process. do you think we could improve this? thanks! |
does the switch process like this:
and what's the max allowed downtime in your case? it can be done, just how to do it in a good way. |
yes, that is pretty much the flow at high level, one more thing in the system is we have a proxy in front, so we could kill connections, switch backends in it at runtime. currently our aurora -> aurora failover has 2-3 sec downtime, so it would be hard for us if the downtime of aurora -> tidb is longer than perhaps 10 sec. thus, right now, we are shoot for less than 10 sec on happy path. in our testing, we set |
one way to achieve this might be:
|
yes, i think that could work as well, thanks! also, it would be great if the cutover api is part of https://github.com/pingcap/tiflow/blob/master/dm/pb/dmmaster.pb.go#L3756, which we are using in our migration tools programmatically |
If there are still problems, you can continue to reopen this issue. |
@okJiang could you help to link the pr fixing implementing this ask and perhaps with examples on how to use it? thanks! |
I carefully browsed the previous Q&A. I found that the solution to your doubts has been given in the answer. But you still want to be able to use the API directly. I think this is still a feature request and has yet to be implemented. So I reopen this issue. Sorry for accidentally closing it @hihihuhu |
Is your feature request related to a problem?
the current flushed binlog position for validation is recorded in the validator after it dispatches the event, https://github.com/pingcap/tiflow/blob/master/dm/syncer/data_validator.go#L989, however, that doesn't mean the worker has picked it up and update the pending count as it happens in another thread, thus
binlog position in validator >= a specific binlog position and pending count = 0
doesn't guarantee the downstream is fully caught up with upstream.also currently, the frequency to update the position tights to the meta flush interval, which is too infrequent. in order to use the validation information to determine if downstream is caught up with upstream during the failover process, it requires the binlog position updated in realtime to avoid extended downtime, which it should be totally possible because it is just a in memory meta.
Describe the feature you'd like
for 1, increase the pending event counter in data validator instead of worker thread
and for 2, should rename
flushedLoc
tocurrentLoc
and update it everytime dispatches an eventDescribe alternatives you've considered
No response
Teachability, Documentation, Adoption, Migration Strategy
No response
The text was updated successfully, but these errors were encountered: