dm validation binlog position is not accurate and not updated frequently enough #8463

hihihuhu · 2023-03-07T17:55:42Z

Is your feature request related to a problem?

the current flushed binlog position for validation is recorded in the validator after it dispatches the event, https://github.com/pingcap/tiflow/blob/master/dm/syncer/data_validator.go#L989, however, that doesn't mean the worker has picked it up and update the pending count as it happens in another thread, thus binlog position in validator >= a specific binlog position and pending count = 0 doesn't guarantee the downstream is fully caught up with upstream.
also currently, the frequency to update the position tights to the meta flush interval, which is too infrequent. in order to use the validation information to determine if downstream is caught up with upstream during the failover process, it requires the binlog position updated in realtime to avoid extended downtime, which it should be totally possible because it is just a in memory meta.

Describe the feature you'd like

for 1, increase the pending event counter in data validator instead of worker thread
and for 2, should rename flushedLoc to currentLoc and update it everytime dispatches an event

Describe alternatives you've considered

No response

Teachability, Documentation, Adoption, Migration Strategy

No response

The text was updated successfully, but these errors were encountered:

lance6716 · 2023-03-08T03:07:32Z

/cc @D3Hunter @buchuitoudegou

D3Hunter · 2023-03-08T03:53:13Z

binlog position printed in validator status is the position where validator has processed, so if it >= a specific binlog position A and pending count = 0 it DOES guarantee the downstream is fully caught up with upstream at position A

validator checkpoint not only include the position, also include pending data which may need retry, since the pending data might quite large depends on how data is changed on upstream, flush checkpoint too frequent may slow validator down and add more pressure to downstream, so the interval is set to a quite large value(5 minutes)

hihihuhu · 2023-03-08T17:33:28Z

thanks for the reply! could you help me to understand more on this? my understanding is the processRowsEvent https://github.com/pingcap/tiflow/blob/master/dm/syncer/data_validator.go#L694 dispatches the event, then the worker thread consumes the mutation async https://github.com/pingcap/tiflow/blob/master/dm/syncer/validate_worker.go#L118 and increases pending event count. so it is possible the data validator goes ahead and record the binlog position, while the validation worker is very slow and hasn't picked up the mutation yet, which breaks the assumption you mentioned?

i agree the flush checkpoint should not be frequently, however, the frequency to update the in memory binlog position doesn't have to be the same of flushing checkpoints? all other fields in the status are updated in realtime, like pending count, but not the binlog position

D3Hunter · 2023-03-09T02:33:05Z

validator.flushedLoc is the loc we have saved in checkpoint last time, all events before it have been validated or in the checkpoint pending data waiting to be retried.

tiflow/dm/syncer/data_validator.go

Line 203 in 5d9f185

flushedLoc *binlog.Location

while the validation worker is very slow and hasn't picked up the mutation yet

before flush we dispatch a flush job to all worker and wait all worker has processed all jobs before the loc

tiflow/dm/syncer/data_validator.go

Line 972 in 5d9f185

wg.Wait()

data validator goes ahead and record the binlog position

this position is stored in validator loop. currEndLoc moved forward every time we receive a event, locationForFlush is changed only we met transaction boundary. those 2 var reflect the location we've received or dispatched but not the location worker has processed.

tiflow/dm/syncer/data_validator.go

Lines 671 to 672 in 5d9f185

    
           currEndLoc := v.streamerController.GetCurEndLocation() 
        
           locationForFlush = v.streamerController.GetTxnEndLocation()

hihihuhu · 2023-03-09T18:40:50Z

oh, i c, thanks, it makes sense!

then i think the issue is if it is possible to update the binlog position more realtime? this is critical for us because during aurora -> tidb migration, we will cut all aurora connection, wait for tidb catching up (binlog position + pending event check in validation), then route traffic to tidb, thus if update binlog position relies on flush checkpoint, then it means very long downtime in the failover process.

do you think we could improve this? thanks!

D3Hunter · 2023-03-10T02:36:19Z

does the switch process like this:

stop write to aurora and get the binlog pos
wait dm sync data to the pos(about 30s on default config to flush syncer checkpoint). forcely kill connection won't work since syncer rely on transaction boundary or heartbeat event to trigger checkpoint flush
- and since dm validator never validate before sync's progress
wait dm validator reached the pos and there's no pending data
switch to tidb

and what's the max allowed downtime in your case?

it can be done, just how to do it in a good way.

hihihuhu · 2023-03-10T18:00:29Z

yes, that is pretty much the flow at high level, one more thing in the system is we have a proxy in front, so we could kill connections, switch backends in it at runtime. currently our aurora -> aurora failover has 2-3 sec downtime, so it would be hard for us if the downtime of aurora -> tidb is longer than perhaps 10 sec. thus, right now, we are shoot for less than 10 sec on happy path.

in our testing, we set checkpoint-flush-interval to 1 s, but cannot reduce meta-flush-interval to less than 1 hour, especially when the job just starts, there would be many mutations to flush out for validation purpose.

D3Hunter · 2023-04-14T07:59:58Z

one way to achieve this might be:

after upstream stop write, send a cutover command to dm job
then dm get current master binlog position, the syncer(since validator never validate before syncer's progress, we need it reach master pos too) and validator try to compare current progress and master position in some small interval(maybe in another routine)
when syncer/validator progress >= master position, flush checkpoint and exit this mode.
then query-status will see that the validator progress reaches master

hihihuhu · 2023-04-14T16:39:41Z

yes, i think that could work as well, thanks! also, it would be great if the cutover api is part of https://github.com/pingcap/tiflow/blob/master/dm/pb/dmmaster.pb.go#L3756, which we are using in our migration tools programmatically

okJiang · 2023-09-26T06:32:15Z

If there are still problems, you can continue to reopen this issue.

hihihuhu · 2023-09-26T16:20:06Z

@okJiang could you help to link the pr fixing implementing this ask and perhaps with examples on how to use it? thanks!

okJiang · 2023-09-27T13:12:54Z

I carefully browsed the previous Q&A. I found that the solution to your doubts has been given in the answer. But you still want to be able to use the API directly. I think this is still a feature request and has yet to be implemented. So I reopen this issue. Sorry for accidentally closing it @hihihuhu

Frank945946 · 2023-12-04T08:54:14Z

@hihihuhu It has been released in V7.5, user doc

hihihuhu added the type/feature Issues about a new feature label Mar 7, 2023

okJiang closed this as completed Sep 26, 2023

okJiang reopened this Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dm validation binlog position is not accurate and not updated frequently enough #8463

dm validation binlog position is not accurate and not updated frequently enough #8463

hihihuhu commented Mar 7, 2023

lance6716 commented Mar 8, 2023

D3Hunter commented Mar 8, 2023

hihihuhu commented Mar 8, 2023

D3Hunter commented Mar 9, 2023 •

edited

Loading

hihihuhu commented Mar 9, 2023

D3Hunter commented Mar 10, 2023

hihihuhu commented Mar 10, 2023 •

edited

Loading

D3Hunter commented Apr 14, 2023

hihihuhu commented Apr 14, 2023 •

edited

Loading

okJiang commented Sep 26, 2023

hihihuhu commented Sep 26, 2023

okJiang commented Sep 27, 2023

Frank945946 commented Dec 4, 2023

dm validation binlog position is not accurate and not updated frequently enough #8463

dm validation binlog position is not accurate and not updated frequently enough #8463

Comments

hihihuhu commented Mar 7, 2023

Is your feature request related to a problem?

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

lance6716 commented Mar 8, 2023

D3Hunter commented Mar 8, 2023

hihihuhu commented Mar 8, 2023

D3Hunter commented Mar 9, 2023 • edited Loading

hihihuhu commented Mar 9, 2023

D3Hunter commented Mar 10, 2023

hihihuhu commented Mar 10, 2023 • edited Loading

D3Hunter commented Apr 14, 2023

hihihuhu commented Apr 14, 2023 • edited Loading

okJiang commented Sep 26, 2023

hihihuhu commented Sep 26, 2023

okJiang commented Sep 27, 2023

Frank945946 commented Dec 4, 2023

D3Hunter commented Mar 9, 2023 •

edited

Loading

hihihuhu commented Mar 10, 2023 •

edited

Loading

hihihuhu commented Apr 14, 2023 •

edited

Loading