-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop index after add index finishing on primary cluster, drop index was retried repeatedly by ticdc due to add index was running on secondary and after a time, changefeed status was failed #10682
Comments
/remove-area dm |
/assign sdojjy |
From the ticdc view, it's not a bug. In this case, we found that the
user can increase the read timeout value in changefeed config to workaround this issue
|
this issue happens in this scenario: |
This might reuslt in changefeed failure, if add index followed by another DDL (not only drop index) executed, and the add index not able to finish in downstream with 20 retries. |
I suggest that, after a DDL retry has failed because of timeout, we check whether there is any running DDL jobs in downstream. If true, instead of scheduling the 3rd retry we poll until the DDL job queue is empty. Polling after a retry failure allows us to proceed after the "successful" retry of ADD INDEX, while prevent the DROP INDEX from retrying 20 times. This should be considered an optimization and does not necessary work in all cases, including:
|
What did you do?
1、restore data for primary and secondary
2、create changefeed and set bdr role for primary and secondary
3、run sysbench on primary and secondary
4、add index and then drop index when add index finished on primary
5、inject network partition between one of tikv and all other pods, which last for 3mins and recover
chaos recover time : 2024-02-29 17:29:24
ticdc logs:
endless-ha-test-ticdc-tps-7080582-1-976-tc-ticdc-0.tar.gz
What did you expect to see?
changefeed status are normal
after recover fault, ddl can sync success and changefeed lag can recover to normal
What did you see instead?
1、drop index was was retried repeatedly even if drop index was queueing on secondary due to add index was running
primary:
secondary:
2、after a time,changefeed status was failed
changefeed status:
Versions of the cluster
./cdc version
Release Version: v8.0.0-alpha
Git Commit Hash: 25ce29c
Git Branch: heads/refs/tags/v8.0.0-alpha
UTC Build Time: 2024-02-27 11:37:29
Go Version: go version go1.21.6 linux/amd64
Failpoint Build: false
current status of DM cluster (execute
query-status <task-name>
in dmctl)No response
The text was updated successfully, but these errors were encountered: