-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Description
Result of troubleshooting a stuck migration with the following symptoms:
- running, copy at 100%
- heartbeat lag growing infinitely
- no errors or crashes
Investigating gh-ost logs showed the following inbetween the last "healthy" progress log and the one where heartbeat lag started to grow:
[gh-ost] : 2025-10-31 21:00:37 INFO rotate to next log from binlog.1000000:0 to binlog.1000000
[gh-ost] : 2025-10-31 21:00:37 INFO rotate to next log from binlog.1000000:104865545 to binlog.1000000
[gh-ost] : 2025-10-31 20:59:52 INFO rotate to next log from binlog.999999:0 to binlog.999999
[gh-ost] : 2025-10-31 20:59:52 INFO rotate to next log from binlog.999999:0 to binlog.999999
...
[gh-ost] : 2025-10-31 20:59:52 INFO rotate to next log from binlog.999999:0 to binlog.999999
[gh-ost] : 2025-10-31 20:59:52 INFO rotate to next log from binlog.999999:104997866 to binlog.999999
[gh-ost] : 2025-10-31 20:59:52 INFO rotate to next log from binlog.999999:0 to binlog.999999
[gh-ost] : [2025/10/31 20:59:52] [info] binlogsyncer.go:868 rotate to (binlog.999999, 4)
Which confirmed my hypothesis that the streamer started dropping all new events erroneously, because the filenames are compared lexographically in the current SmallerThan implementation, which treats 999999 > 1000000:
gh-ost/go/binlog/gomysql_reader.go
Lines 85 to 88 in 48b34bc
| if this.currentCoordinates.SmallerThanOrEquals(&this.LastAppliedRowsEventHint) { | |
| this.migrationContext.Log.Debugf("Skipping handled query at %+v", this.currentCoordinates) | |
| return nil | |
| } |
Please note that BinlogFile has been changed recently (005043d#diff-0b91aa3798ba83a920a77a09b6adf3bfdffbf3cf5f22e323b66753f7affb8ebd), but I thought it makes sense to apply the proposed fix anyway.
Opening a PR shortly ⌛
Metadata
Metadata
Assignees
Labels
No labels